TeaPickingNet: Towards Robust Recognition of Fine-Grained Picking Actions in Tea Gardens Using an Attention-Enhanced Framework

Han, Ru; Zheng, Ye; Shu, Lei; Cielniak, Grzegorz

doi:10.3390/agriculture15232441

Open AccessArticle

TeaPickingNet: Towards Robust Recognition of Fine-Grained Picking Actions in Tea Gardens Using an Attention-Enhanced Framework

¹

College of Smart Agriculture (College of Artificial Intelligence), Nanjing Agriculture University, Nanjing 210095, China

²

Lincoln Institute for Agri-Food Technology, University of Lincoln, Lincoln LN6 7TS, UK

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(23), 2441; https://doi.org/10.3390/agriculture15232441

Submission received: 21 October 2025 / Revised: 6 November 2025 / Accepted: 19 November 2025 / Published: 26 November 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

With the emergence of smart agriculture, precise behavior recognition in tea gardens has become increasingly important for operational monitoring and intelligent management. However, existing behavior detection systems face limitations when deployed in real-world agricultural environments, particularly due to dense vegetation, variable lighting, and diverse human–machine interactions. This paper proposes a novel deep learning framework for picking behavior recognition tailored to complex tea plantation environments. We first construct a large-scale, annotated dataset comprising 12,195 images across 7 behavior categories, collected from both field and web sources, capturing a diverse range of geographic, temporal, and environmental conditions. To address occlusion and multi-scale detection challenges, we enhance YOLOv5 by integrating an Exponential Moving Average (EMA) attention mechanism, Complete Intersection over Union (CIoU) loss, and Atrous Spatial Pyramid Pooling (ASPP), achieving a 73.6% mAP@0.5, representing an 11.6% relative improvement over the baseline model, which indicates a notable enhancement in detection accuracy under complex tea garden conditions. Furthermore, we propose an SE-Faster R-CNN model by embedding Squeeze-and-Excitation (SE) channel attention modules and anchor box optimization strategies, which significantly boosts performance in complex scenarios. A lightweight visual interface for real-time image and video-based detection is also developed to enhance the practical deployability of the system. Experimental results demonstrate the effectiveness, robustness, and real-time potential of the proposed system in recognizing tea garden behaviors under real field conditions.

Keywords:

tea garden; picking behavior recognition; YOLOv5; Faster R-CNN; deep learning; attention mechanism; smart agriculture; computer vision; agricultural automation

1. Introduction

Tea cultivation has long been a cornerstone of China’s agricultural economy and cultural heritage. With growing global interest in ecological farming and rural tourism, tea gardens are increasingly adopting smart technologies to enhance efficiency and sustainability [1,2]. A key aspect of intelligent management is picking behavior recognition, which supports high-quality harvesting, labor monitoring, and prevention of unauthorized activities.

While traditional picking methods have been practiced for millennia, they struggle to meet modern demands, particularly in activity monitoring and precise localization [3,4,5]. Advances in technologies such as the Internet of Things (IoT), big data, cloud computing, and artificial intelligence (AI) are transforming agriculture [6,7,8]. In particular, AI and deep learning offer new opportunities for smart tea gardens, enabling effective solutions for behavior recognition and localization. Table 1 summarizes the significance of picking behavior detection in tea garden scenarios.

The Figure 1 illustrates the annual tea production in China (in tons) and the corresponding growth rate (%) from 2013 to 2023. This dual-axis graph provides a comprehensive view of both the volume of tea produced and the rate at which production has increased or decreased over the years. In recent years, China’s tea production has shown a steady growth trend. According to the chart data, It increased from 180.21 tons in 2013 to 354.12 tons in 2023, almost doubling in ten years. This remarkable growth reflects the booming development of China’s tea industry and its important contribution to the global tea market.

However, growth rates varied across years, peaking at 11.59% in 2015 and turning negative (−1.61%) in 2016, likely due to climate variability, advances in cultivation techniques, and market demand fluctuations.

In agriculture, deep learning and image processing are primarily applied to crop weed identification, pest and disease monitoring, and livestock behavior analysis. For weed detection, ref. [9] combined monocular visual features with plant height in an SVM model for field classification. Ref. [10] improved accuracy to 85.8% in corn fields by integrating spectral bands into a Bayesian discriminant model. Ref. [11] developed a dual-classification system using enhanced Canny edge detection and a BP neural network, achieving high accuracy for both broad-leaved and gramineous weeds [12].

In pest and disease monitoring, ref. [13] demonstrated that image segmentation boosts tomato leaf disease classification accuracy from 42.3% to 98.6%, highlighting the importance of lesion localization. Subsequent studies include: ref. [14] enabling real-time wheat disease detection on mobile devices via traditional segmentation and statistical inference; ref. [15] developing a DCNN for cucumber disease identification; and [16,17] proposing a multi-task system for coffee leaf stress classification and severity assessment, successfully deployed on mobile platforms.

In the field of poultry behavior monitoring research, refs. [18,19] achieved high-precision identification of sow behaviors based on voiceprint feature clustering and support vector machines; refs. [20,21] obtained a 92% accuracy rate in egg-laying hen behavior recognition through sound signal frequency domain conversion and neural network training; ref. [22] fused the Kalman filter tracking algorithm (KF Tracking) with the GoogleNet network to realize real-time analysis of pig behaviors; refs. [23,24] constructed a chicken health monitoring framework based on image segmentation and deep learning algorithms.

Although intelligent applications in tea gardens are still limited, related research in smart orchards primarily focuses on fruit tree pest and disease detection, as well as recognition of floral parts such as flowers, buds, and fruits [25,26]. In tea garden scenarios, deep learning and image processing are mainly applied to picker behavior analysis, supporting optimized management and harvesting practices.

To trace the evolution of computer vision in agricultural human activity recognition, this study presents a systematic review of recent representative works (Table 2). The analysis reveals a clear trend: from static image analysis to dynamic video understanding, from single-modality to multi-source data fusion, and toward models balancing accuracy and efficiency.

Early studies (e.g., [27,28,29]) focused on plant disease and weed detection in static images using SVM, random forests, and basic CNNs, establishing the foundation of agricultural image analysis. Later, complexity increased: ref. [30] introduced multi-task learning, and [31] employed CNN-LSTM architectures, marking a shift toward temporal behavior recognition and severity estimation in dynamic scenarios.

Recent research emphasizes lightweight models and multi-source fusion. Works [30,32] reduce model size and computation while maintaining accuracy for edge deployment, and [31] demonstrates that fusing multispectral and thermal infrared data enhances performance in complex farmland conditions.

However, existing methods remain inadequate for fine-grained, continuous behaviors like tea plucking. Most approaches (e.g., [33,34,35]) rely on general-purpose datasets and struggle with tea garden challenges such as occlusion, small targets (e.g., fingers, buds), and variable illumination. While [36] employs multimodal sensing in tea gardens, it focuses on drought stress, not human activity recognition.

To address this gap, we propose TeaPickingNet—a dedicated deep learning framework for tea plucking behavior recognition, inspired by [37,38,39]—integrating efficiency and robustness for real-world agricultural deployment.

This study focuses on designing a robust, deep learning framework specifically for recognizing human and mechanical picking behaviors in tea garden scenarios. The core contributions of this work are:

Construction and open-sourcing of a novel, large-scale, labeled dataset tailored to tea garden behavior recognition.
Enhancement of the YOLOv5 model using attention and context-aware modules for high-speed, high-accuracy object detection.
Design of a channel-attentive SE-Faster R-CNN model to better handle occluded and multi-scale object features.

Table 2. Summary of representative studies on agricultural behavior and object detection.

Ref.	Year	Data Source	Main Task	Main Techniques	Key Findings
[27]	2016	Binocular images of maize and weeds (2–5 leaf stage)	Weed recognition for precision weeding	SVM with fused height and monocular image; feature optimization via GA and PSO	Achieved 98.3% accuracy; fusion of height and image features improved weed discrimination by 5%
[28]	2017	3637 wheat disease images (2014–2016 field captures from Spain/Germany)	Early detection of wheat diseases (septoria, rust, tan spot) using mobile devices	Hierarchical algorithm with: (1) Color constancy normalization (2) SLIC superpixel segmentation (3) Hot-spot detection (Naive Bayes) (4) LBP texture features (5) Random Forest classification	Achieved AuC > 0.8 for all diseases in field tests. Maintained performance across 7 mobile devices and variable lighting conditions
[35]	2018	1184 cucumber disease images (field captures + PlantVillage/ ForestryImages)	Weed recognition for precision weeding	SVM with fused height and monocular image; feature optimization via GA and PSO	Achieved 98.3% accuracy; fusion of height and image features improved weed–crop discrimination by 5%
[37]	2019	1747 coffee leaf images (4 biotic stresses)	Multi-task classification and severity estimation	(1) Modified ResNet50 with parallel FC layers (2) Mixup data augmentation (3) t-SNE feature visualization	Achieved 94.05% classification accuracy and 84.76% severity estimation. Multi-task learning reduced training time by 40% compared to single-task models.
[29]	2020	Plant Village, Farm images, Internet images	Improve plant disease detection accuracy using segmented images	Segmented CNN with image segmentation, Full-image CNN for comparison, CNN architecture with convolution/pooling layers	S-CNN achieves 98.6% accuracy (vs. 42.3% for F-CNN) on independent test data. Segmentation improves model confidence by 23.6% on average
[33]	2021	UCF-101 dataset (13,320 videos)	Human action recognition using transfer learning	(1) CNN-LSTM (2) Transfer learning from ImageNet (3) Two-stream spatial -temporal fusion (4) Data augmentation techniques	(1) Best architecture achieved 94.87% accuracy (2) Transfer learning improved performance by 4–7% (3) Two-stream models outperformed single-stream (4) VGG16 performed best among pre-trained models
[34]	2022	Field studies from Chinese tea plantations	Review effects of intercropping on tea plantation ecosystems	Meta-analysis of 88 studies on tea intercropping systems	Intercropping improves microclimate (+15% temp stability), soil nutrients (+50% SOM), and tea quality (+20% amino acids) while reducing pests (−30%)
[39]	2022	UCF101, HMDB51, Kinetics-100, SSV2 datasets	Few-shot action recognition	(1) Prototype Aggregation Adaptive Loss (PAAL) (2) Cross-Enhanced Prototype (CEP) (3) Dynamic Temporal Transformation (DTT) (4) Reweighted Similarity Attention (RSA)	(1) Achieves 5-way 5-shot accuracy of 75.1% on HMDB51 (2) Outperforms TRX by 1.2% with fewer parameters (3) Adaptive margin improves performance by 0.6% (4) Temporal transformation enhances generalization
[38]	2023	15 benchmark video datasets (NTU RGB+D, Kinetics, etc.)	Comprehensive review of deep learning for video action recognition	(1) Taxonomy of 9 architecture types (2) Performance comparison of 48 models (3) Temporal analysis frameworks	Identified 3D CNNs and hybrid architectures as most effective, with ViT-based models showing 12% accuracy gains over CNN baselines
[30]	2024	ETHZ Food-101, Vireo Food-172, UEC Food-256, ISIA Food-500	Develop lightweight food image recognition model	Global Shuffle Convolution, Parallel block structure (local + global convolution), Reduced network layers	Achieves SOTA performance with fewer parameters (28% reduction) and FLOPs (37% reduction) while improving accuracy by 0.7% compared to MobileViTv2
[31]	2024	FF++, DFDC, CelebDF, HFF datasets	Lightweight face forgery detection	(1) Spatial Group-wise Enhance (SGE) module (2) TraceBlock for global texture extraction (3) Dynamic Fusion Mechanism (DFM) (4) Dual-branch architecture	(1) Achieves 94.87% accuracy with only 963k parameters (2) Outperforms large models while being 10× smaller (3) Processes 236 fps for real-time detection (4) Effective cross-dataset generalization
[32]	2025	4 pest datasets (D1-5869, D1-500, D2-1599, D2-545)	Optimal integration of attention mechanisms with MobileNetV2 for pest recognition	(1) Systematic attention integration framework (2) 6 attention variants (3) Layer-wise optimization (4) Performance-efficiency tradeoff analysis	BAM12 configuration achieved 96.7% accuracy (D1-5869) with only 3.8M parameters. MobileViT showed poorest performance (83% accuracy). Attention maps revealed improved pest localization
[36]	2025	UAV-based MS, thermal infrared, and RGB imagery	Drought stress classification in tea plantations	Multi-source data fusion (MS+TIR); Improved GA-BP neural network (RSDCM)	Multi-source fusion (MS+TIR) outperformed single-source data; RSDCM achieved the highest accuracy (98.3%) and strong generalization across all drought levels

2. Materials and Methods

Since the recognition of picking and other behaviors in tea garden scenarios mainly involves the analysis and selection of object detection algorithms, on this basis, improvements are made regarding human posture, spatial positions between tea plants and humans, etc. Therefore, this part mainly selects mainstream object detection algorithms and optimized modules, introduces their structures and characteristics respectively, and evaluates their advantages in recognizing picking and other behaviors in tea garden scenarios.

2.1. Dataset Introduction

As one of China’s key economic crops, the tea industry is undergoing a necessary transformation toward mechanization and intelligent automation, aligning with the broader trends of modern agricultural development. In this transition, innovative applications of AI-based recognition technologies have provided robust technical support for large-scale and diversified tea harvesting operations—yet such advancements require a specialized dataset tailored to tea garden environments [40,41]. To address this need, this study presents a comprehensive dataset of human picking behaviors in tea garden scenarios. Table 3 summarizes the key specifications and characteristics of the developed dataset. The proposed dataset demonstrates significant advantages in three key aspects:

Enhanced data diversity and model robustness: Through data augmentation techniques such as rotation, cropping, contrast adjustment, and flipping, the dataset significantly improves sample diversity, thereby enhancing the model’s robustness in complex and variable field conditions.
Precise and comprehensive annotations: The dataset includes accurate annotations such as bounding box coordinates, category labels, and object size parameters. These detailed annotations provide a solid foundation for effective feature extraction and model performance optimization.
Support for multi-scale training: The dataset enables multi-scale training strategies, which effectively improve the model’s accuracy in detecting targets at varying distances and sizes, ensuring strong adaptability in real-world deployment scenarios.

Table 3. Information related to the dataset.

Category	Details
Research Field	Behavior Recognition in Tea Garden Scenarios
Data Type	Raw Data: A total of 12,195 images. Image data are saved in “jpg” format. The annotated labels of the raw data are summarized in “xlsx” format.
Data Source	The image data used in this study come from two sources: 1. Public video data, collected via web scraping from mainstream platforms like Baidu’s Haokan Video, Xigua Video, and Bilibili, mostly related to tea picking, was sliced into individual frames. 2. Collected through on-site filming at tea gardens in the Yuhuatai area of Nanjing.
Accessibility	Repository Name: IEEE DataPort DOI: 10.21227/dnkh-8e73 Direct URL (Supplementary Materials): https://ieee-dataport.org/documents/image-dataset-analyzing-tea-picking-behavior-tea-plantations, accessed on 20 October 2025

This paper proposes a comprehensive technical framework for tea garden picking behavior recognition, as illustrated in Figure 2. The framework encompasses three main stages: data acquisition, demonstrated agricultural data processing, and a deep learning framework TeaPickNet.

In the data acquisition stage, a multi-source data collection strategy is adopted. On one hand, public video data related to tea picking is obtained from mainstream video platforms such as Baidu’s Haokan Video, Xigua Video, and Bilibili through web scraping technology, and these videos are sliced into individual frames. On the other hand, field collection is conducted at the Huanglongxian tea garden in the Yuhua District of Nanjing to supplement the dataset. The overall data situation shows that the dataset for identifying picking behaviors in tea gardens has a total of 7 categories and 12,195 images, laying a solid foundation for subsequent research.

In the demonstrated agricultural data processing stage, a series of data preprocessing operations are carried out. First, data filtering is performed to remove noise and irrelevant information from the original image dataset. Then, data enhancement techniques are used to enrich the diversity of the dataset, improving the generalization ability of the model. Finally, data annotation is conducted using the Labelme tool to provide accurate label information for model training.

The new deep learning framework TeaPickNet is the core of this technical framework. It includes two main branches: one branch focuses on improving the YOLOv5 algorithm, including the introduction of attention mechanisms (such as EMA-Efficient Multi-scale Attention, SE-Squeeze-and-Excitation, and SA-Self-Attention) and the optimization of loss functions (such as GIoU-Generalized Intersection over Union, SIoU-Structural Intersection over Union, and CIoU-Complete Intersection over Union). The other branch is dedicated to optimizing the SE-Faster R-CNN algorithm, including feature fusion and anchor box reset operations. Through a series of ablation experiments and comparative analyses, including attention mechanism ablation experiments, loss function ablation experiments, training efficiency and loss convergence characteristic analyses, and detection accuracy comparisons, the performance of the improved models is comprehensively evaluated. The results show that the improved models have significant advantages in tea garden picking behavior recognition tasks, providing effective technical support for the automatic recognition of tea garden picking behaviors.

2.1.1. Data Collection

This work constructs a tea garden harvesting behavior recognition dataset, the labels of which consist of a total of seven categories: hand-picking, machine harvesting, walking, standing, conversing, storage tools, and harvesting tools. The video scenes encompass various terrains such as high mountains, hills, and plains, as well as diverse weather conditions including clear days, foggy days, and rainy days. The image and video resolutions include formats such as 480p, 720p, and 1080p, with video clip dimensions including 1920 × 1280, 854 × 480, and 852 × 480. Considering that some original image frames do not contain any of the aforementioned categories, such frames could adversely affect the final judgment. Therefore, during the video frame extraction process, we deleted certain frames prone to errors while striving to maintain video continuity, and subsequently sorted the annotated images to form the original dataset.

The entire dataset was randomly divided into training, validation, and testing subsets in a ratio of 70%:15%:15% at the video level to avoid overlapping frames from the same source in different subsets. This ensures that all frames from one video sequence belong exclusively to a single subset, thereby maintaining independence between training and testing data.

It is worth noting that all raw data in this study originate from continuous video recordings rather than isolated images. During dataset construction, a uniform frame sampling strategy was adopted to slice videos into still frames while preserving their temporal continuity. Each extracted frame thus retains representative visual states of the picking sequence, ensuring that temporal correlations among consecutive actions (e.g., reaching, grasping, and plucking) are partially embedded in the sampled dataset. This design allows the static detection framework to approximate dynamic picking processes without requiring sequential modeling.

2.1.2. Data Filtering

The raw image dataset undergoes further filtering. As the dataset primarily contains labels requiring clear and complete visual identification—hand-picking, machine harvesting, walking, standing, conversing, storage tools, and harvesting tools—frames exhibiting blur, low quality, unclear targets, or incomplete objects, which are common in video-derived slices, are removed during filtering to eliminate ambiguous instances.

2.1.3. Data Augmentation

Following video clipping and data filtering, the number of instances for certain behaviors and tools is inevitably reduced, particularly for web-sourced data. To balance the distribution across categories, mitigate model overfitting towards specific actions or tools, enhance dataset diversity, and improve model robustness, speed, and accuracy in real-world applications, data augmentation is performed. Four fundamental techniques are applied: rotation, flipping, contrast enhancement, and cropping [42,43]. These augmentations are randomly applied to each original image, effectively multiplying the dataset size.

2.1.4. Data Annotation

The tea garden harvesting behavior dataset is annotated following strict format guidelines: (1) Video processing: Video data is processed using FFmpeg, with clips renamed sequentially (format: video_id_timestamp) and stored in respective directories. (2) Labeling: Labeling is performed using Labelme, generating JSON annotations. Label order is strictly maintained to prevent confusion. Scripts subsequently convert the JSON annotations into COCO, YOLO, and VOC formats suitable for model training. The dataset includes single-label annotations for the specified behaviors and tools. (3) Validation: A final review by professional tea harvesters verifies the accuracy and realism of the annotations.

The composition of the tea garden picking behavior recognition dataset is shown in the attached Table 4. There are 128 videos in total, including 12,195 key frames and 18,468 labeled boxes. In addition, in order to ensure the universality and compatibility of the dataset, this paper crawls videos about tea garden picking from the Internet in multiple scenarios, including different angles (overlooking, looking up, and looking straight), different weathers (cloudy, sunny, foggy, rainy, and cloudy), different picking methods (manual and machine), different numbers of people (single person and multiple people), and different distances (long shot, close-up, and hand close-up). At the same time, the involved scenarios cover various parts of China, including the western plains and eastern tea gardens, and the data comes from 138 different video scenarios. The richness of the shooting angles and distances of the dataset is considered for possible mobile deployment in the future to meet the corresponding needs. Table 4 list the classification and slicing of the dataset.

2.2. Methods

The operating system used in the experiments of this paper is Windows 11 Version 23H2, and the experimental environment is built on the PyTorch 1.9.0 framework. The hardware platform is configured with an Intel Xeon(R) Platinum 8362 CPU and an NVIDIA GeForce RTX 3090 GPU with 24 GB memory. The parameter settings are shown in Table 5.

2.2.1. Improvement of Picking Behavior Recognition Model Based on YOLO

Based on the YOLOv5 network, this paper addresses the problem of detection performance interference caused by complex backgrounds in tea garden scenarios by introducing an attention mechanism, which effectively extracts key information related to picking behaviors and thereby reduces the interference of background noise. In addition, to enhance the model’s ability to detect multi-scale targets, the Atrous Spatial Pyramid Pooling (ASPP) module is incorporated [44]. This module enables flexible multi-scale feature extraction through customizable dilation rates, further expanding the model’s receptive field. Meanwhile, the loss function is optimized by introducing the Complete Intersection over Union (CIoU) loss, which accelerates the model’s convergence speed. Figure 3 presents the structure diagram of the improved YOLOv5s network proposed in this paper.

(1): Efficient Multi-scale Attention (EMA)

The EMA mechanism adopts a dual-path multi-scale interaction architecture, enabling dynamic feature optimization through collaborative learning of global perception and local enhancement, with its core structure illustrated in Figure 4. Input features are first grouped for parallel processing. The upper part constitutes the global perception path: for each group of features, average pooling is performed along the X and Y axes (X/Y Avg Pool) respectively [45,46]. After extracting multi-scale direction-sensitive features via 3 × 3 convolution (Conv (3 × 3)), the features are concatenated and the channel dimension is compressed through 1 × 1 convolution (Conv (1 × 1)) to generate global channel weights. The lower part serves as the local enhancement path: the grouped features undergo successive GroupNorm and spatial pooling (Avg Pool), and a cross-spatial pixel correlation matrix is computed using a dual-branch Softmax (Cross-spatial learning). Local detail responses are then strengthened through matrix multiplication (Matmul).

Figure 3. Improved YOLOv5 network structure diagram.

Figure 4. The structure of the SE attention mechanism.

The outputs of the dual paths are adaptively weighted and summed (at the “+” node), followed by Sigmoid activation, which dynamically fuses global semantics and local spatial information (Re-weight module) to ultimately generate joint spatial-channel attention weights. The design of the EMA mechanism—characterized by grouped lightweight computation, multi-scale directional perception (X/Y Avg Pool), and cross-spatial correlation modeling (Matmul)—effectively enhances the discriminability and robustness of feature representation in complex scenarios. This characteristic makes EMA well-suited for tasks like tea garden picking behavior recognition with intricate backgrounds, offering inherent advantages in addressing practical challenges such as target occlusion and object size variations.

(2): The Atrous Spatial Pyramid Pooling (ASPP) Module

To enhance the efficiency of multi-scale feature extraction, the Atrous Spatial Pyramid Pooling (ASPP) module features atrous convolutions with adjustable receptive fields. By setting different dilation rates (e.g., 2, 4, 8), the atrous convolution mechanism can effectively expand the coverage of feature extraction while maintaining the

3 \times 3

convolution kernel size. This design enables the network to capture both subtle changes in hand movements during tea garden picking and overall action characteristics. In the global feature extraction stage of the ASPP module, adaptive average pooling is employed. The specific process involves three steps: first, compressing features of each channel into a

1 \times 1

size; second, reconstructing features using

1 \times 1

convolution; third, restoring spatial dimensions through upsampling and fusing them with other features. This structural design retains global background information while ensuring the ability to express detailed features.

The integration of the ASPP module into the YOLOv5 model is primarily motivated by the complex spatial interactions between pickers and tea plants in tea garden environments. Multi-scale feature extraction can better handle differences in target sizes; meanwhile, a larger receptive field helps capture partially occluded picking actions when tea farmers’ hand movements are obscured by branches and leaves. With its pyramid-based feature fusion mechanism, the ASPP module can simultaneously recognize fine hand gestures and overall postures. Thus, incorporating the ASPP module into the original YOLOv5 network structure can better improve accuracy for tea garden scenarios. The structure of the ASPP module is shown in Figure 5.

(3): The Atrous Spatial Pyramid Pooling (ASPP) Module

In object detection tasks, the YOLOv5 model employs Complete Intersection over Union (CIoU) as the default bounding box regression loss function. CIoU comprehensively considers three key geometric factors: the intersection-over-union (IoU) of overlapping regions, the Euclidean distance between center coordinates, and the matching degree of width-height ratios. Other commonly used effective loss functions include Generalized IoU (GIoU), an improved bounding box regression loss function that addresses the limitations of traditional IoU by introducing the smallest enclosing convex hull region. GIoU primarily focuses on two geometric factors: the IoU of overlapping regions and the difference between the minimum enclosing rectangle of the predicted and ground-truth boxes. Structural IoU (SIoU) loss is an innovative bounding box regression loss function that mitigates the flaws of traditional IoU by incorporating an angle-aware mechanism. It comprehensively accounts for four critical geometric elements: the IoU of overlapping regions, the distance between the centers of the two boxes, the similarity of angles, and the matching degree of width-height ratios. The mathematical expression of SIoU is given by (1):

L_{S I o U} = IoU - \frac{Δ + Ω}{2}

(1)

where IoU measures the overlap degree; the angle cost term (

Δ = 1 - 2 \cdot {sin}^{2} θ

) corrects directional deviations; and the distance cost term (

Ω

) integrates constraints on center distance and width-height ratio. In tea garden picking scenarios, SIoU exhibits significant advantages over CIoU and GIoU:

The angle term effectively handles detection box rotation caused by the body tilt of the picker.
The distance term adapts to minor position differences among dense plants.
It maintains high sensitivity to partially occluded targets. Therefore, this experiment considers introducing SIoU to improve accuracy.

2.2.2. Algorithm Improvement Based on Faster R-CNN

To address the challenges of complex backgrounds, diverse target scales, and partial occlusions in tea garden picking behavior recognition, this section focuses on optimizing the Faster R-CNN algorithm. By integrating attention mechanisms, optimizing anchor box settings, and enhancing feature fusion strategies, the improved SE-Faster R-CNN model is developed to achieve more robust detection performance in specific agricultural scenarios.

The Faster R-CNN network was optimized into the SE-Faster R-CNN model to better adapt to behavior recognition in tea garden picking scenarios. First, anchor box settings in object detection were re-optimized using a clustering algorithm based on the distribution characteristics of annotated boxes in the dataset, making them more consistent with the actual target distribution in tea garden picking scenarios and thus improving detection accuracy and efficiency [47,48]. Second, the SE-NET attention mechanism was introduced to reduce interference from complex backgrounds, enabling the model to automatically learn and focus on key feature regions related to picking behaviors while suppressing useless background information, further enhancing detection performance.

In addition, to strengthen the feature network’s ability to understand multi-layer semantic features, fusion optimization was performed on the feature network. By reasonably designing the feature fusion strategy, the model can fully utilize feature information at different levels to achieve more accurate description of targets, thus improving the detection ability for small-sized targets in complex scenarios. The structure diagram of the improved Faster R-CNN network is shown in Figure 6.

(1): The Region Proposal Network (RPN)

RPN is a key component in two-stage detection frameworks, it’s primarily responsible for generating candidate regions of foreground targets. This module employs a flexible threshold-based judgment mechanism, distinguishing foreground from background based on the strength of semantic responses in feature maps. During feature extraction, shallow networks retain abundant spatial details, which are crucial for accurately localizing key regions such as hand movements in tea picking actions. To this end, this study designed a cross-layer feature fusion method in RPN, integrating high-resolution shallow features with deep features rich in semantic information. This significantly improves the accuracy of target localization in complex tea garden environments.

Figure 6. The structure of SE-Faster R-CNN network structure diagram.

Furthermore, since the original anchor box design of standard Faster R-CNN is incompatible with the tea garden picking scenarios described in this paper, a novel anchor box optimization scheme based on target statistical features is proposed. By conducting clustering analysis on the geometric features of targets (e.g., pickers’ postures and tool specifications) in the dataset, a set of anchor box parameters more aligned with actual scenarios is reconstructed.

(2): The Anchor Size Setting

The original Faster R-CNN adopts a multi-scale preset box strategy, including combinations of three basic scales and three aspect ratios. However, its parameters are derived from general multi-category object detection datasets such as VOC and COCO. Experiments show that the size distribution of preset anchor boxes has a significant deviation from the actual target scales, failing to well match the behavior dataset in the tea garden scenario used in this paper. To establish an anchor box generation mechanism matching the target scale distribution, this study introduces an unsupervised clustering algorithm to statistically analyze the real sizes of vehicle bounding boxes in the dataset. Feature space modeling of annotated data is conducted based on the K-Means clustering method. The re-set anchor box sizes are shown in Table 6.

(3): Network Architecture Enhancement Based on The Attention Mechanism

In the field of computer vision, CNNs serve as a pivotal architecture for tasks such as object detection, with remarkable achievements attained. CNNs rely on the alternating operations of convolution, downsampling, and nonlinear activation, making them highly suitable for extracting key features from large volumes of data. However, constructing high-performance networks remains challenging, and one critical issue lies in optimizing the network’s ability to represent features. Recent studies have shown that adopting methods to enhance features in the spatial domain can effectively improve model performance. Examples include multi-scale feature fusion, spatial context modeling, and channel attention mechanisms. These approaches strengthen features from different perspectives, significantly boosting network performance in various visual tasks.

Building on this foundation, to address the characteristics of variable target scales, frequent occlusions, and complex lighting conditions in the tea garden scenarios studied in this paper, a multi-level channel attention enhancement mechanism is proposed, which introduces triple SE modules into the Faster R-CNN framework.

The SE module is embedded into the Bottleneck structure of ResNet50 to realize dynamic calibration of bottom-level feature channels. The specific implementation is shown in (2):

X_{out} = σ (f_{ex} (f_{sq} (X_{in}))) ⊙ X_{in}

where

X_{out}

is the output feature of the Bottleneck;

f_{sq} (\cdot)

denotes the global average pooling operation;

f_{ex} (\cdot)

represents the excitation function containing two fully connected layers;

σ

stands for the Sigmoid activation; and ⊙ indicates the channel-wise weighting operation. With the design of a 16-fold channel compression ratio, this module can adaptively enhance the response of key features in tea garden scenarios.

The ResNet residual module adopts a dual-path feature processing architecture, whose core consists of two residual blocks forming the main feature processing branch. Input features first undergo preliminary feature extraction through the first residual block, and then the feature flow is divided into a main path and a bypass: The main path sequentially passes through the Inception module, the fully connected layer (FC) of the bottleneck structure, the ReLU activation function, and the second FC layer, and finally generates channel attention weights through Sigmoid activation. Meanwhile, the bypass retains the original feature approximation

\tilde{X}

. The two paths perform feature fusion in the Scale module, where the SE mechanism is introduced to calibrate the attention weights, and finally output the enhanced feature

\tilde{X}

. The structure of the ResNet residual module combined with the SE module is shown in detail in Figure 7.

The SE module is introduced after the feature pyramid convolution layer of the FPN to establish a cross-scale channel attention mechanism. Through the channel attention mechanism of the SE module, a cross-scale feature interaction channel is constructed, enabling the network to adaptively calibrate the channel response weights of features at different levels.

3. Results

3.1. Ablation Experiment

To verify the feasibility of the proposed algorithm, this study conducted 9 groups of experiments, including ablation experiments on loss functions and attention mechanisms. The evaluation was performed using the following metric: mean average precision at an intersection-over-union (IoU) threshold of 0.5 (mAP@0.5).

The core objectives of the ablation experiments are as follows: (1) to quantitatively analyze the differences in adaptability of different loss functions (CIoU, GIoU, SIoU) in complex agricultural scenarios; (2) to verify the effectiveness of three attention mechanisms (SE, EMA, SA) in extracting features of multi-scale picking behaviors; (3) to explore the optimal combination strategy of loss functions and attention mechanisms. In the experiments, a full-combination design was adopted, involving systematic testing of all nine combinations of the three loss functions (CIoU, GIoU, SIoU) and three attention mechanisms (SE, EMA, SA).

The experiments were conducted using the control variable method, with consistent settings for the dataset, training strategy, hardware environment, basic framework, and data augmentation scheme. This ensures the comparability of experimental results and focuses on examining the model’s adaptability in complex tea garden scenarios. The results of the ablation experiments are presented in Table 7 and visualized in Figure 8.

In the research on picking behavior recognition for tea garden scenarios, the evaluation of model training results requires a comprehensive consideration of precision, recall, and mAP@0.5. Precision reflects the model’s ability to avoid false detections, recall measures its capability to reduce missed detections, and mAP@0.5 indicates the overall accuracy of localization and classification. In this study, Precision, Recall, and mAP@0.5 were adopted as standard detection metrics. Accuracy and specificity are not directly applicable to object detection tasks, while F1 can be inferred from Precision and Recall.

According to the ablation experiment results, among the nine proposed combination schemes, the pairing of SIoU and EMA achieves the best performance. This model reaches an mAP@0.5 of 0.76, while maintaining a high recall of 0.94 and a precision of 0.96. Such performance benefits from two key factors: first, the SIoU loss function introduces a center point distance penalty term, which can effectively address the problem of bounding box localization deviation when pickers are densely arranged; second, the EMA attention mechanism, with its multi-scale feature fusion capability, significantly enhances the model’s adaptability in complex scenarios such as human bodies occluded by branches and leaves.

Further in-depth analysis of the performance differences between various combinations leads to the following key findings:

The choice of loss function has a significant impact on model performance. Compared with GIoU and CIoU, SIoU can increase detection accuracy by an average of 1.1% to 1.8%.
Among the three attention mechanisms, EMA exhibits the best generalization performance. Although SE achieves the highest precision of 0.979, it has a relatively low recall, indicating that SE may excessively suppress effective features of some occluded targets. The SA module performs stably across all performance metrics and can be used when resources are limited.

Beyond numerical comparison, the observed differences among combinations of attention mechanisms and loss functions reveal their distinct optimization behaviors. The EMA attention enhances multi-scale spatial consistency, enabling the model to better suppress redundant vegetation textures. SE attention achieves high precision by emphasizing salient regions, but it tends to neglect weakly illuminated or partially occluded targets, leading to lower recall. SA attention provides a balanced but less discriminative response, which is suitable for computationally constrained applications.

In terms of loss functions, SIoU demonstrates stronger geometric adaptability due to its inclusion of distance and angle penalties, improving localization for irregular postures and small-scale gestures. In contrast, GIoU and CIoU converge faster but are less stable under viewpoint and illumination variations. These findings suggest that the synergy between EMA and SIoU is particularly effective in complex tea garden environments, as it combines robust spatial attention with fine-grained geometric alignment.

3.2. The Training Efficiency and Loss Convergence Characteristics

Training efficiency directly affects training time and reflects training speed, while loss convergence characteristics can effectively indicate the model’s learning ability. The stability and speed of the loss function’s decline help determine whether the model can effectively fit the data. During training, the improved model exhibited significantly better convergence: the loss value was much lower at the start of training, and the convergence speed was faster. Due to the excellent cooperation between the residual structure and the SE channel attention mechanism, the model became more stable during training, with the fluctuation range of the loss value reduced by approximately half. Finally, the converged loss value was 4.5% lower than that of the baseline model. Analysis of the validation set revealed that under the combined effect of multi-scale feature fusion and the attention mechanism, the improved model showed significantly enhanced robustness in complex scenarios. Although the final validation loss of the improved model was the same as that of the original model, its prediction stability increased by approximately 50%. This fully demonstrates that the adopted optimization strategies have significantly improved the model’s adaptability in handling situations such as tea farmers being occluded by tea branches and leaves, and beam changes caused by different weather conditions in tea gardens. The comparison chart of loss function results is shown in Figure 9.

Detection accuracy can effectively reflect the overall localization and classification accuracy. The improved model has achieved a significant breakthrough in the key detection metric mAP@0.5. By virtue of cross-layer feature fusion and optimization of anchor box parameters, when the anchor box size is 12 × 12, the detection recall rate has increased from 68% of the original model to 83%, which well solves the problem that the original model is prone to missing small-scale targets. Meanwhile, the SE channel attention mechanism has further enhanced the model’s ability to discriminate features, increasing the detection success rate by 18% for scenes with long shooting distances and improving the accuracy of category classification by 9%. These improvements have promoted the mAP@0.5 value of the model in the middle stage from less than 0.75 of the original model to 0.78. The comparison chart of detection accuracy is shown in Figure 10. And the Figure 11 shows the test result graphs before and after optimization for four identical data.

4. Discussion

This study proposed TeaPickingNet, a new deep learning framework designed to address the unique challenges of behavior recognition in tea garden environments. The framework integrates improved versions of YOLOv5 and Faster R-CNN with attention mechanisms, loss function enhancements, and customized dataset support, achieving robust performance under real-world agricultural conditions.

4.1. Model Improvements for Complex Tea Garden Scenarios

Tea garden environments pose unique challenges for object detection due to dense vegetation, occlusion by tea plants, variable lighting, and scale variation in human–machine interactions. The proposed model addresses these challenges through the following innovations:

The YOLOv5 branch, enhanced with EMA attention and ASPP modules, enables more effective multi-scale feature extraction and background suppression, improving localization accuracy under cluttered conditions.
The adoption of SIoU loss, incorporating angle and distance penalties, improves bounding box regression in scenes where pickers exhibit diverse postures and orientations, which is common in freeform tea-picking activities.
The Faster R-CNN branch, augmented with multi-level SE attention and anchor box optimization, further enhances detection of small or occluded targets, particularly in long-shot or top-view images commonly seen in surveillance and UAV-based data collection.

4.2. Ablation Insights: Performance Gains from Attention and Loss Functions

The ablation study systematically compared combinations of three attention mechanisms (EMA, SE, SA) and three loss functions (CIoU, GIoU, SIoU). Key findings include:

The EMA + SIoU combination achieved the highest mAP@0.5 (0.76), precision (0.96), and recall (0.94), showing its strong generalization ability in dynamic outdoor scenes.
While SE attention achieved the highest precision (0.97), it slightly reduced recall, indicating potential overemphasis on dominant features and reduced sensitivity to occluded objects.
GIoU and CIoU, while computationally simpler, demonstrated less robustness in bounding box localization for tea-picking scenarios with irregular movements and partial visibility.

These results validate the importance of customized loss functions and attention mechanisms in agricultural behavior recognition tasks, where standard solutions may underperform.

Furthermore, during visual inspections across diverse lighting and environmental conditions, the model exhibited consistent detection behavior. Under backlit or high-contrast settings, EMA attention adaptively recalibrated spatial responses, reducing false positives caused by bright reflections on tea leaves. Similarly, SIoU improved bounding box stability for pickers partially obscured by foliage. These qualitative observations support the quantitative ablation findings and confirm that the proposed design effectively enhances robustness against background interference and lighting variability, even without additional model retraining.

4.3. Dataset Generalization and Real-World Deployability

The dataset constructed in this work significantly contributes to the robustness and field adaptability of the proposed models. It features:

High diversity: covering 138 unique scenarios with variations in terrain, lighting, weather, camera angle, and distance.
Comprehensive annotations: enabling multi-scale training and robust model learning.
Balance between field and web-collected data: ensuring both real-world variability and controlled quality.

Although the current model processes single-frame images, the raw dataset was collected from continuous video streams, inherently containing temporal dynamics of picking behaviors. The frame-slicing strategy used in this study captures key action states across time, which ensures partial representation of motion continuity. However, explicit temporal modeling—such as tracking motion transitions or inter-frame dependencies—is not yet incorporated. Future work will integrate temporal learning frameworks (e.g., Temporal Shift Module (TSM), ConvLSTM, or SlowFast networks) to exploit motion cues and further enhance recognition of continuous picking behaviors.

4.4. Remaining Challenges and Future Work

Despite promising results, several limitations restrict practical application and further development:

Lack of temporal modeling: Current frame-based recognition ignores temporal dependencies across frames, which are vital for understanding dynamic behaviors (e.g., transitions from standing to picking, sequential hand movements in plucking). This causes inaccuracies in recognizing multi-frame actions, such as misclassifying a mid-picking pause as a static posture.
- Future work will adopt video-based architectures (TSM, SlowFast) to capture motion patterns. TSM will model short-term motion via temporal feature shifting; SlowFast will combine high-resolution spatial details (slow pathway) and high-frame-rate temporal dynamics (fast pathway) for short/long-term dependencies. Temporal attention will weight key frames, aiming to reduce dynamic behavior recognition errors by at least 15%.
Model efficiency and deployment: The architecture increases computational load and memory usage, limiting deployment on resource-constrained edge devices (drones, embedded systems) in agriculture. For example, drone-based real-time monitoring may suffer delays due to high computational demands.
- Lightweight strategies (pruning, quantization, edge framework integration) will be explored. Pruning removes redundant neurons/layers; quantization converts to lower bit-width to reduce complexity.
Scene-specific data limitations: The dataset, though diverse in tea varieties/methods, is mainly from Chinese tea gardens. Unique local factors (soil, climate, plant growth) may cause overfitting, degrading performance in other regions or agricultural contexts (e.g., fruit picking).

Although the current dataset incorporates samples from multiple geographic regions across China and includes diverse environmental conditions (such as mountainous, hilly, and plain tea gardens under various weather scenarios), it still represents a single national domain. Therefore, the generalization ability of TeaPickingNet to cross-regional or cross-crop scenarios has not been fully validated. Differences in plantation layout, light reflection, or foliage density in other regions may lead to domain shift and affect detection performance. In future work, we plan to conduct cross-domain validation using datasets collected from tea plantations in different provinces to evaluate model transferability. Domain adaptation and fine-tuning strategies will also be investigated to mitigate regional bias and enhance robustness across heterogeneous environments.

5. Conclusions

In this study, we propose TeaPickingNet, a novel deep learning framework tailored for tea garden picking behavior recognition. By constructing a large-scale, annotated dataset encompassing diverse conditions and introducing significant improvements to YOLOv5 and Faster R-CNN architectures, our method achieves robust performance in complex agricultural scenarios. Experimental results demonstrate that the integration of SIoU loss and EMA attention enhances recognition accuracy and stability under occlusion, variable lighting, and multi-scale target conditions. Compared to the baseline, the proposed method increases the mAP@0.5 by over 11.6%, showing a clear relative improvement in recall and precision.

Overall, the integrated attention–loss optimization scheme not only improves accuracy but also strengthens the model’s robustness to environmental diversity, making TeaPickingNet suitable for deployment in dynamic outdoor tea garden environments.

Furthermore, a real-time detection interface has been developed to facilitate practical applications in field settings. While the framework shows promise for intelligent monitoring in smart agriculture, further optimization is required for lightweight deployment and temporal behavior modeling. This research lays a solid foundation for intelligent tea garden management and opens up new possibilities for applying deep learning in real-world agricultural behavior recognition tasks.

Supplementary Materials

The dataset has been published as a standalone data article, which can be accessed and downloaded here: https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2024.1473558/full [49], accessed on 20 October 2025.

Author Contributions

Conceptualization, R.H., Y.Z. and L.S.; data curation management, Y.Z.,R.H.; formal analysis, R.H. and L.S.; funding acquisition, L.S. and G.C.; investigation, Y.Z. and G.C.; methodology, R.H. and L.S.; supervision, L.S. and G.C., writing, R.H., Y.Z., L.S. and G.C.; writing—review and editing, L.S. and G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Research Start-Up Fund for Talent Introduction of Guangdong University of Petrochemical Technology: Agricultural Behavior Data Collection and Dataset Production, under Grant: 2023rcyj2012, in 2023.

Data Availability Statement

The dataset used in this study is publicly available on IEEE DataPort to facilitate research reproducibility and further development in smart agriculture and behavior recognition. It includes 12,195 annotated images across seven tea garden activity categories, along with metadata and annotation files. The dataset can be accessed at: https://doi.org/10.21227/dnkh-8e73 (DOI: 10.21227/dnkh-8e73). Researchers are welcome to download and use the data for non-commercial academic purposes under the specified license. In addition, the complete experimental code of this study has been released on GitHub to further support reproducibility and transparency. The source code is available at: https://github.com/Monica-hr/TeaPickingNet, accessed on 20 October 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mao, Y.; Li, H.; Wang, Y.; Xu, Y.; Fan, K.; Shen, J.; Han, X.; Ma, Q.; Shi, H.; Bi, C.; et al. A Novel Strategy for Constructing Ecological Index of Tea Plantations Integrating Remote Sensing and Environmental Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 12772–12786. [Google Scholar] [CrossRef]
Li, H.; Wang, Y.; Fan, K.; Mao, Y.; Shen, Y.; Ding, Z. Evaluation of important phenotypic parameters of tea plantations using multi-source remote sensing data. Front. Plant Sci. 2022, 13, 898962. [Google Scholar] [CrossRef]
Luo, B.; Sun, H.; Zhang, L.; Chen, F.; Wu, K. Advances in the tea plants phenotyping using hyperspectral imaging technology. Front. Plant Sci. 2024, 15, 1442225. [Google Scholar] [CrossRef]
Li, H.; Mao, Y.; Wang, Y.; Fan, K.; Shi, H.; Sun, L.; Shen, J.; Shen, Y.; Xu, Y.; Ding, Z. Environmental Simulation Model for Rapid Prediction of Tea Seedling Growth. Agronomy 2022, 12, 3165. [Google Scholar] [CrossRef]
Ndlovu, H.S.; Odindi, J.; Sibanda, M.; Mutanga, O. A systematic review on the application of UAV-based thermal remote sensing for assessing and monitoring crop water status in crop farming systems. Int. J. Remote Sens. 2024, 45, 4923–4960. [Google Scholar] [CrossRef]
Wang, S.M.; Yu, P.C.; Ma, J.; Ouyang, J.X.; Zhao, Z.M.; Xuan, Y.; Fan, D.M.; Yu, J.F.; Wang, X.C.; Zheng, X.Q. Tea yield estimation using UAV images and deep learning. Ind. Crops Prod. 2024, 212, 118358. [Google Scholar] [CrossRef]
Luo, D.; Gao, Y.; Wang, Y.; Shi, Y.C.; Chen, S.; Ding, Z.; Fan, K. Using UAV image data to monitor the effects of different nitrogen application rates on tea quality. J. Sci. Food Agric. 2021, 102, 1540–1549. [Google Scholar] [CrossRef] [PubMed]
Sun, L.; Shen, J.; Mao, Y.; Li, X.; Fan, K.; Qian, W.; Wang, Y.; Bi, C.; Wang, H.; Xu, Y.; et al. Discrimination of tea varieties and bud sprouting phenology using UAV-based RGB and multispectral images. Int. J. Remote Sens. 2025, 46, 6214–6234. [Google Scholar] [CrossRef]
Zhao, H.; Zhuge, Y.; Wang, Y.; Wang, L.; Lu, H.; Zeng, Y. Learning Universal Features for Generalizable Image Forgery Localization. arXiv 2025, arXiv:2504.07462. [Google Scholar] [CrossRef]
Møller, H.; Berkes, F.; Lyver, P.O.; Kislalioglu, M.S. Combining Science and Traditional Ecological Knowledge: Monitoring Populations for Co-Management. Ecol. Soc. 2004, 9, 2. [Google Scholar] [CrossRef]
Haralick, R.M.; Shanmugam, K.S.; Dinstein, I. Textural Features for Image Classification. IEEE Trans. Syst. Man Cybern. 1973, 3, 610–621. [Google Scholar] [CrossRef]
Yang, X.; Pan, L.; Wang, D.; Zeng, Y.; Zhu, W.; Jiao, D.; Sun, Z.; Sun, C.; Zhou, C. FARnet: Farming Action Recognition From Videos Based on Coordinate Attention and YOLOv7-tiny Network in Aquaculture. J. ASABE 2023, 66, 909–920. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef]
Li, H.; Mao, Y.; Shi, H.; Fan, K.; Sun, L.; Zaman, S.; Shen, J.; Li, X.; Bi, C.; Shen, Y.; et al. Establishment of deep learning model for the growth of tea cutting seedlings based on hyperspectral imaging technique. Sci. Hortic. 2024, 331, 113106. [Google Scholar] [CrossRef]
Mazumdar, A.; Singh, J.; Tomar, Y.S.; Bora, P.K. Universal Image Manipulation Detection using Deep Siamese Convolutional Neural Network. arXiv 2018, arXiv:1808.06323. [Google Scholar] [CrossRef]
Esgario, J.G.M.; Castro, P.B.C.; Tassis, L.M.; Krohling, R.A. An app to assist farmers in the identification of diseases and pests of coffee leaves using deep learning. Inf. Process. Agric. 2022, 9, 38–47. [Google Scholar] [CrossRef]
Guo, Z.; Wang, L.; Yang, W.; Yang, G.; Li, K. LDFnet: Lightweight Dynamic Fusion Network for Face Forgery Detection by Integrating Local Artifacts and Global Texture Information. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 1255–1265. [Google Scholar] [CrossRef]
Zou, M.; Zhou, Y.; Jiang, X.; Gao, J.; Yu, X.; Ma, X. Spatio-Temporal Behavior Detection in Field Manual Labor Based on Improved SlowFast Architecture. Appl. Sci. 2024, 14, 2976. [Google Scholar] [CrossRef]
Mazumdar, A.; Bora, P.K. Siamese convolutional neural network-based approach towards universal image forensics. IET Image Process. 2020, 14, 3105–3116. [Google Scholar] [CrossRef]
Dai, C.; Liu, X.; Lai, J. Human action recognition using two-stream attention based LSTM networks. Appl. Soft Comput. 2020, 86, 105820. [Google Scholar] [CrossRef]
Chen, S.; Wang, Y.; Lei, Z. Two Stream Convolution Fusion Network based on Attention Mechanism. J. Phys. Conf. Ser. 2021, 1920, 012070. [Google Scholar] [CrossRef]
Ullah, H.; Munir, A. Human Activity Recognition Using Cascaded Dual Attention CNN and Bi-Directional GRU Framework. J. Imaging 2022, 9, 130. [Google Scholar] [CrossRef]
Zhang, M.; Hu, H.; Li, Z.; Chen, J. Action detection with two-stream enhanced detector. Vis. Comput. 2022, 39, 1193–1204. [Google Scholar] [CrossRef]
Liakos, K.G.; Busato, P.; Moshou, D.; Pearson, S.; Bochtis, D.D. Machine Learning in Agriculture: A Review. Sensors 2018, 18, 2674. [Google Scholar] [CrossRef] [PubMed]
Sun, Z.; Liu, J.; Ke, Q.; Rahmani, H. Human Action Recognition From Various Data Modalities: A Review. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 45, 3200–3225. [Google Scholar] [CrossRef]
Han, R.; Shu, L.; Li, K. A Method for Plant Disease Enhance Detection Based on Improved YOLOv8. In Proceedings of the 2024 IEEE 33rd International Symposium on Industrial Electronics (ISIE), Ulsan, Republic of Korea, 18–21 June 2024; pp. 1–6. [Google Scholar]
Can, W.; Zhiwei, L. Weed recognition using SVM model with fusion height and monocular image features. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 2016, 32, 165–174. [Google Scholar] [CrossRef]
Johannes, A.; Picon, A.; Alvarez-Gila, A.; Echazarra, J.; Rodriguez-Vaamonde, S.; Navajas, A.D.; Ortiz-Barredo, A. Automatic plant disease diagnosis using mobile capture devices, applied on a wheat use case. Comput. Electron. Agric. 2017, 138, 200–209. [Google Scholar] [CrossRef]
Sharma, P.; Berwal, Y.P.S.; Ghai, W. Performance analysis of deep learning CNN models for disease detection in plants using image segmentation. Inf. Process. Agric. 2020, 7, 566–574. [Google Scholar] [CrossRef]
Sheng, G.; Min, W.; Yao, T.; Song, J.; Yang, Y.; Wang, L.; Jiang, S. Lightweight Food Image Recognition With Global Shuffle Convolution. IEEE Trans. AgriFood Electron. 2024, 2, 392–402. [Google Scholar] [CrossRef]
Gong, H.; Zhuang, W. An Improved Method for Extracting Inter-Row Navigation Lines in Nighttime Maize Crops Using YOLOv7-Tiny. IEEE Access 2024, 12, 27444–27455. [Google Scholar] [CrossRef]
Janarthan, S.; Thuseethan, S.; Joseph, C.; Vigneshwaran, P.; Rajasegarar, S.; Yearwood, J. Efficient Attention-Lightweight Deep Learning Architecture Integration for Plant Pest Recognition. IEEE Trans. AgriFood Electron. 2025, 3, 548–560. [Google Scholar] [CrossRef]
Abdulazeem, Y.; Balaha, H.M.; Bahgat, W.M.; Badawy, M. Human Action Recognition Based on Transfer Learning Approach. IEEE Access 2021, 9, 82058–82069. [Google Scholar] [CrossRef]
Lei, X.; Wang, T.; Yang, B.; Duan, Y.; Zhou, L.; Zou, Z.; Ma, Y.; Zhu, X.; Fang, W. Progress and perspective on intercropping patterns in tea plantations. Beverage Plant Res. 2022, 2, 18. [Google Scholar] [CrossRef]
Ma, J.; Du, K.; Zheng, F.; Zhang, L.; Gong, Z.; Sun, Z. A recognition method for cucumber diseases using leaf symptom images based on deep convolutional neural network. Comput. Electron. Agric. 2018, 154, 18–24. [Google Scholar] [CrossRef]
Xu, Y.; Mao, Y.; Li, H.; Li, X.; Sun, L.; Fan, K.; Li, Z.; Gong, S.; Ding, Z.; Wang, Y. Construction and application of a drought classification model for tea plantations based on multi-source remote sensing. Smart Agric. Technol. 2025, 12, 101132. [Google Scholar] [CrossRef]
Esgario, J.G.; Krohling, R.A.; Ventura, J.A. Deep learning for classification and severity estimation of coffee leaf biotic stress. Comput. Electron. Agric. 2019, 169, 105162. [Google Scholar] [CrossRef]
Wang, Z.; Yang, Y.; Liu, Z.; Zheng, Y. Deep Neural Networks in Video Human Action Recognition: A Review. arXiv 2023, arXiv:2305.15692. [Google Scholar] [CrossRef]
Liu, S.; Jiang, M.; Kong, J. Multidimensional Prototype Refactor Enhanced Network for Few-Shot Action Recognition. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6955–6966. [Google Scholar] [CrossRef]
Iwana, B.K.; Uchida, S. An empirical survey of data augmentation for time series classification with neural networks. PLoS ONE 2020, 16, e0254841. [Google Scholar] [CrossRef]
Wen, Q.; Sun, L.; Song, X.; Gao, J.; Wang, X.; Xu, H. Time Series Data Augmentation for Deep Learning: A Survey. In Proceedings of the International Joint Conference on Artificial Intelligence, Virtual, 11–17 July 2020. [Google Scholar]
Sawicki, A.; Saeed, K. Application of LSTM Networks for Human Gait-Based Identification. In Theory and Engineering of Dependable Computer Systems and Networks; Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J., Eds.; Springer: Cham, Switzerland, 2021. [Google Scholar]
Abujrida, H.; Agu, E.O.; Pahlavan, K. DeepaMed: Deep learning-based medication adherence of Parkinson’s disease using smartphone gait analysis. Smart Health 2023, 30, 100430. [Google Scholar] [CrossRef]
Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobile-Former: Bridging MobileNet and Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5260–5269. [Google Scholar]
Tang, Z.; Li, M.; Wang, X. Mapping Tea Plantations from VHR Images Using OBIA and Convolutional Neural Networks. Remote Sens. 2020, 12, 2935. [Google Scholar] [CrossRef]
Tu, J.; Liu, H.; Meng, F.; Liu, M.; Ding, R. Spatial-Temporal Data Augmentation Based on LSTM Autoencoder Network for Skeleton-Based Human Action Recognition. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 3478–3482. [Google Scholar]
Sawicki, A.; Zieliński, S.K. Augmentation of Segmented Motion Capture Data for Improving Generalization of Deep Neural Networks. In Proceedings of the International Conference on Computer Information Systems and Industrial Management Applications, Bialystok, Poland, 16–18 October 2020. [Google Scholar]
Thakur, D.; Kumar, V. FruitVision: Dual-Attention Embedded AI System for Precise Apple Counting Using Edge Computing. IEEE Trans. AgriFood Electron. 2024, 2, 445–459. [Google Scholar] [CrossRef]
Han, R.; Zheng, Y.; Tian, R.; Shu, L.; Jing, X.; Yang, F. An image dataset for analyzing tea picking behavior in tea plantations. Front. Plant Sci. 2025, 15, 1473558. [Google Scholar] [CrossRef]

Figure 1. Tea production and year-on-year growth in China from 2013 to 2023.

Figure 2. Technical flowchart of tea garden picking behavior recognition.

Figure 5. The structure of ASPP module.

Figure 7. The ResNet residual module combined with the SE module.

Figure 8. The visualization diagram of the ablation experiment.

Figure 9. The comparison of loss function results (Left: before optimization; Right: after optimization).

Figure 10. The comparison of detection accuracy results (Left: before optimization; Right: after optimization).

Figure 11. The visualization diagram of the test.

Table 1. Research on the significance of tea garden scene picking behavior recognition.

Number	The Significance of Picking Behavior Identification
1	Intervene in unauthorized picking in non-tourist areas promptly
2	Identify under-picked zones via data analysis for timely inspection
3	Monitor visitor picking activities to maintain order in tourist plantations
4	Analyze historical data for scientific guidance to boost tea garden profits
5	Prohibit picking in rare tree plantations and ensure standardized picking in production gardens

Table 4. The composition of the tea garden picking identification dataset.

Division	Classification	Number of Slices
Number of Pickers	Single Person	5654
Number of Pickers	Multiple People	6541
Picking Method	Manual	11,896
Picking Method	Machine	299
Picking Weather	Sunny	3001
	Cloudy	4271
	Overcast	3852
	Foggy	482
	Rainy	214
Shooting Distance	Close-up	10,400
	Long-shot	754
	Close-up + Long-shot	1037

Table 5. The training parameter configuration.

Parameter Name	Parameter Setting	Parameter Name	Parameter Setting
Cache	TRUE	Close mosaic	20
Epoch	300	Scale	0.75
Single cls	False	Mosaic	1.0
Batch	32	Mixup	0.2

Table 6. The reset the anchor size.

Anchor Size	Anchor Aspect Ratio
12 × 12	1:1
12 × 12	2:3
12 × 12	3:2
24 × 24	1:1
24 × 24	4:3
24 × 24	3:4
96 × 96	1:1
96 × 96	4:3
96 × 96	16:9

Table 7. The results of ablation experiment.

Model	Precision	Recall	mAP@0.5
CIoU+EMA	0.92	0.91	0.73
GIoU+EMA	0.93	0.95	0.74
SIoU + EMA	0.96	0.94	0.76
CIoU + SA	0.93	0.94	0.71
GIoU + SA	0.93	0.92	0.76
SIoU + SA	0.96	0.93	0.74
CIoU + SE	0.96	0.92	0.79
GIoU + SE	0.94	0.94	0.76
SIoU + SE	0.97	0.90	0.77

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, R.; Zheng, Y.; Shu, L.; Cielniak, G. TeaPickingNet: Towards Robust Recognition of Fine-Grained Picking Actions in Tea Gardens Using an Attention-Enhanced Framework. Agriculture 2025, 15, 2441. https://doi.org/10.3390/agriculture15232441

AMA Style

Han R, Zheng Y, Shu L, Cielniak G. TeaPickingNet: Towards Robust Recognition of Fine-Grained Picking Actions in Tea Gardens Using an Attention-Enhanced Framework. Agriculture. 2025; 15(23):2441. https://doi.org/10.3390/agriculture15232441

Chicago/Turabian Style

Han, Ru, Ye Zheng, Lei Shu, and Grzegorz Cielniak. 2025. "TeaPickingNet: Towards Robust Recognition of Fine-Grained Picking Actions in Tea Gardens Using an Attention-Enhanced Framework" Agriculture 15, no. 23: 2441. https://doi.org/10.3390/agriculture15232441

APA Style

Han, R., Zheng, Y., Shu, L., & Cielniak, G. (2025). TeaPickingNet: Towards Robust Recognition of Fine-Grained Picking Actions in Tea Gardens Using an Attention-Enhanced Framework. Agriculture, 15(23), 2441. https://doi.org/10.3390/agriculture15232441

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TeaPickingNet: Towards Robust Recognition of Fine-Grained Picking Actions in Tea Gardens Using an Attention-Enhanced Framework

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Introduction

2.1.1. Data Collection

2.1.2. Data Filtering

2.1.3. Data Augmentation

2.1.4. Data Annotation

2.2. Methods

2.2.1. Improvement of Picking Behavior Recognition Model Based on YOLO

2.2.2. Algorithm Improvement Based on Faster R-CNN

3. Results

3.1. Ablation Experiment

3.2. The Training Efficiency and Loss Convergence Characteristics

4. Discussion

4.1. Model Improvements for Complex Tea Garden Scenarios

4.2. Ablation Insights: Performance Gains from Attention and Loss Functions

4.3. Dataset Generalization and Real-World Deployability

4.4. Remaining Challenges and Future Work

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI