1. Introduction
As a traditional and important economic crop, tea relies heavily on appearance quality for market grading and value assessment. As consumer expectations continue to rise and the tea industry undergoes intelligent transformation, the rapid and precise assessment of tea leaf appearance has become a crucial factor driving digital development. Manual, experience-driven inspection is limited by subjectivity, low efficiency, and a high labor cost, and it adapts poorly to multi-variety, multi-batch, and complex operating conditions. These constraints place higher demands on automated appearance inspections based on computer vision.
In recent years, artificial intelligence in agricultural settings has advanced steadily across tasks such as crop recognition, pest monitoring, and product grading, and the tea domain has seen effective explorations. For instance, Zhang et al. [
1] achieve efficient identification of one-bud-two-leaf tea samples on edge devices through structural simplification and channel pruning. Cao et al. [
2] proposed incorporating the GhostNet module together with coordinate attention to achieve the lightweight and efficient detection of tea buds. Shuai et al. [
3] enhance the identification of dense small targets by integrating image modality information and modifying the YOLOv5 framework. Wang et al. [
4] incorporated attention mechanisms along with improved feature fusion methods, which enhanced the accuracy of small tea bud detection under complex background conditions. Meng et al. [
5] refined the YOLOX-tiny and PSPNet architectures to achieve reliable tea bud identification and the precise determination of picking positions. Yang et al. [
6] construct a classification model for tea bud recognition based on YOLOX. Zhao et al. [
7] propose a multi-variety tea bud detection approach by integrating an improved YOLOv7 with an ECA attention mechanism. Meng et al. [
8] further enhance feature extraction by embedding DSConv, CBAM, and CA modules into an improved YOLOv7 network. Shi et al. [
9] proposed a small-object detection approach tailored to complex background scenarios by integrating the Swin Transformer with YOLOv8. In contrast, Xie et al. [
10] enhanced detection performance under challenging conditions by combining deformable convolutions, attention modules, and an improved spatial pyramid pooling structure.
In parallel, lightweight object detection oriented to edge and embedded use cases has progressed rapidly and shows promising transferability and deployability in heterogeneous conditions. Moosmann et al. [
11] present TinyissimoYOLO, which leverages quantization and low-memory optimization for efficient detection on low-power microcontrollers. Li et al. [
12] propose Edge-YOLO that integrates pruning and feature fusion for lightweight infrared detection on edge devices. Betti et al. [
13] proposed YOLO-S, a lightweight framework with compact architecture and enhanced feature extraction, designed for small-object detection in aerial images. Reis et al. [
14] use YOLOv8 with transfer learning to build a lightweight real-time detector that is trained on multi-class flying-object data and fine-tuned in complex environments, achieving high Precision for small targets and occlusions. Alqahtani et al. [
15] benchmark different detectors on edge hardware to evaluate the performance of lightweight methods. Nghiem et al. [
16] proposed LEAF-YOLO, which integrates lightweight feature extraction with multi-scale fusion to enable the real-time detection of small objects in aerial imagery on edge devices.
Despite these advances, tea leaves, which exhibit subtle inter-class differences, complex textures, and significant batch variability, still pose unresolved challenges. Accuracy and speed remain difficult to balance, and parameter counts are often large, which hinders deployment on mobile and embedded platforms. In production scenarios with multiple coexisting varieties, large scale variation, and dense stacking, feature representation is easily disturbed, the bounding box regression becomes unstable, and small objects are frequently missed. It is therefore necessary to conduct targeted research on tea leaf appearance recognition and to adopt dedicated lightweight design so that detection accuracy is maintained together with stable real-time inference. This direction improves the automation and consistency of quality evaluation and supports smart agriculture in resource-constrained settings.
To tackle these issues, we introduce a lightweight detection framework named TeaAppearanceLiteNet for evaluating tea leaf appearance, in which multiple innovative modules are embedded into the overall architecture. The proposed method effectively lowers computational overhead while sustaining, and in some cases, enhancing, detection accuracy, thus demonstrating a robust real-time capability. The main contributions of this study are outlined as follows:
- 1.
The C3k2_PartialConv module incorporates PartialConv operations to effectively minimize redundant calculations and reduce memory access overhead.
- 2.
To address the limitations of CBAM in channel attention, this work presents the CBMA_MSCA mechanism, which incorporates a multi-scale strategy.
- 3.
The Detect_PinwheelShapedConv head is introduced, leveraging pinwheel-shaped convolutions to enhance feature perception and spatial representation capabilities.
- 4.
The MPDIoU_ShapeIoU loss function is developed by combining MPDIoU and ShapeIoU, aiming to improve detection accuracy and regression stability.
Overall, TeaAppearanceLiteNet achieves a better balance between compactness and accuracy and exhibits a clear intrinsic distinctiveness relative to lightweight variants of the YOLO family. C3k2_PartialConv suppresses redundancy and performs selective channel updates, improving feature utilization and computing efficiency while preserving the backbone topology and avoiding the representational loss associated with depthwise separability and layerwise pruning; CBMA_MSCA injects a multi-scale context and introduces saliency competition to realize fine-grained discrimination across scales, which matches the subtle differences and wide scale range of tea leaves; Detect_PinwheelShapedConv, together with MPDIoU_ShapeIoU, forms a closed loop of perception enhancement and shape consistency, where orientation-sensitive convolution strengthens spatial representation and multidimensional regression stabilizes localization, leading to a better shape sensitivity and robustness. Benefiting from these designs, TeaAppearanceLiteNet is suitable for agricultural vision tasks in resource-limited deployments, provides a lightweight, efficient, and practical solution for smart agriculture, and offers a transferable design paradigm for fine-grained analysis in tea leaf appearance recognition.
2. Materials and Methods
2.1. Architecture of the TeaAppearanceLiteNet Network
In this work, we present TeaAppearanceLiteNet, whose overall framework is illustrated in
Figure 1. The architecture incorporates the C3k2_PartialConv module, leveraging the advantages of PartialConv to improve computational efficiency and feature representation while maintaining structural effectiveness. The CBMA_MSCA attention mechanism is employed to enable the multi-scale modeling of channel attention, allowing for a more refined extraction of salient features across targets of varying sizes. The Detect_PinwheelShapedConv head introduces a pinwheel-shaped convolution in place of part of the conventional convolution operations, enhancing both feature perception and spatial representation. To strengthen detection robustness, this study introduces the MPDIoU_ShapeIoU regression loss, which simultaneously accounts for the spatial position, geometric shape, and scale consistency between predicted and ground-truth boxes, thus improving both accuracy and regression stability.
2.2. C3k2_PartialConv
This work presents the C3k2_PartialConv module, where the Bottleneck in the original C3k structure of the C3k2 module is substituted with PartialConv [
17] to enable more efficient and accurate feature extraction. The overall design is shown in
Figure 2.
The core idea of PartialConv is that convolution operations are performed on only a portion of the input feature map channels, while the other channels are directly forwarded without modification. In practice, either the initial or final continuous segment of channels is typically selected as the convolution subset to facilitate contiguous memory access and improve execution efficiency. The channel ratio for the subset (e.g., 1/4 or 1/2) is predefined and has been validated across multiple tasks to retain most of the essential information effectively.
To strengthen feature fusion, a 1 × 1 convolution is performed across all channels after the PartialConv operation. This step compensates for the untouched channels and improves the overall completeness and accuracy of feature representation.
By preserving the multi-scale processing strengths of the C3k module and optimizing convolutional efficiency, this modification ensures both accuracy and computational effectiveness. The enhancement is particularly evident in boundary detection and scenes with complex backgrounds, where higher feature extraction Precision is required. By exploiting the redundancy among feature channels, the C3k2_PartialConv module markedly improves computational efficiency while avoiding a substantial increase in parameters or overall computational burden.
In the context of tea leaf appearance inspection, the selective convolution mechanism of PartialConv enhances the ability to capture fine-grained edge information, which is especially valuable for delineating irregular leaf boundaries. Moreover, by emphasizing critical spatial features while reducing redundant computation, the C3k2_PartialConv module improves robustness against variations in leaf shape and complex background interference, thereby supporting more accurate and reliable feature extraction in agricultural vision tasks.
2.3. CBMA_MSCA
This work introduces the CBMA_MSCA attention mechanism, which extends the original CBAM [
18] by substituting its channel attention component with the MSCA module [
19]. Through this replacement, multi-scale channel modeling is incorporated to improve the accuracy of feature representation. The overall architecture is shown in
Figure 3.
For channel modeling, CBMA_MSCA leverages the multi-branch bar-shaped convolutional structure of MSCA, including kernels such as 1 × 7, 7 × 1, 1 × 11, 11 × 1, 1 × 21, and 21 × 1 to extract features at multiple scales in parallel. This enables the effective capture of both local details and long-range dependencies, enhances the ability to model significant inter-channel interactions, and improves the accuracy and discriminative power of channel attention.
For spatial modeling, CBMA_MSCA preserves the spatial attention component of CBAM, where spatial features are obtained via a parallel average and max pooling, followed by convolution to produce the spatial attention map. This enhances the sensitivity of the network to key spatial positions and improves the perception of object structure and spatial layout.
CBMA_MSCA leverages multi-scale channel modeling together with refined spatial saliency to establish joint channel–spatial attention, which substantially improves its capacity for effective feature selection. Additionally, CBMA_MSCA inherits the lightweight design of CBAM by employing depthwise separable convolutions, resulting in minimal computational and parameter overhead. With strong adaptability and generalization, this design proves effective for a broad spectrum of visual recognition tasks.
Compared with other attention mechanisms, CBMA_MSCA achieves a higher accuracy because the introduction of multi-scale strip convolutions effectively aggregates contextual information across different receptive fields. This design provides a superior capability in modeling objects of varying sizes, which is critical for dense prediction tasks. Therefore, CBMA_MSCA can capture fine-grained details while simultaneously maintaining global consistency, leading to more precise feature representations and an improved overall performance.
2.4. Detect_PinwheelShapedConv
This study introduces the detection head Detect_PinwheelShapedConv, which integrates PinwheelShapedConv [
20] to enable more effective feature extraction and receptive field expansion. This design is particularly well-suited for detecting weak and small targets. The architecture is shown in
Figure 4.
Unlike standard convolution, Detect_PinwheelShapedConv adopts the distinctive asymmetric padding strategy of a pinwheel-shaped convolution. Through the outward alternation of horizontal and vertical kernels, the receptive field is greatly extended. This innovative design facilitates efficient low-level feature capture, enhances object–background discrimination, and significantly strengthens the modeling of subtle target features.
Detect_PinwheelShapedConv greatly enlarges the receptive field while incurring only a slight growth in parameter count. This parameter efficiency is attributed to its grouped convolution structure, which enables the significant enlargement of the receptive field while maintaining low computational overhead. Owing to its structure, the design is particularly advantageous for small-object detection when dealing with faint targets and background complexity.
Furthermore, Detect_PinwheelShapedConv improves tea leaf appearance inspection by effectively handling shape variation, fine-grained edges, and low-contrast defects. Its expanded anisotropic receptive field captures curved contours under deformation, while alternating horizontal–vertical kernels enhance sensitivity to serrated margins and microcracks. The Gaussian-aligned emphasis increases contrast for weak textures and blemishes, reducing background interference. In addition, the decoupled-head formulation allows for classification and localization branches to specialize, jointly improving Precision and stability for subtle defects in complex production settings.
2.5. MPDIoU_ShapeIoU
This study proposes the MPDIoU_ShapeIoU loss function, which effectively combines the positional information of MPDIoU [
21] with the shape and scale descriptors of ShapeIoU [
22]. This integration allows for more sensitive capture of both location discrepancies and differences in geometric proportions, thereby improving generalization across targets of varying scales and shapes. As a result, the accuracy and robustness of the bounding box regression are significantly enhanced.
The MPDIoU_ShapeIoU loss function is defined as follows:
computes the ratio between the overlap and the union of two boxes, as given in (2):
where
is the intersection area of boxes
and
, and
is the union area.
denotes the center-to-center Euclidean distance of the two boxes, as in (3).
measures the discrepancy of shapes between the two boxes, as in (4):
where
and
and
and
are the width and height of
and
, respectively.
and
denote the vertical edge distances between the two boxes, where
is the vertical distance along one axis, defined in (5):
where
is the vertical distance along the other axis, defined in (6):
where
is a weighting factor that can be adjusted according to box size or other considerations, with the purpose of rescaling the vertical distance so that the influence is appropriate for boxes of different sizes, as shown in (7).
2.6. Dataset
The tea leaf appearance dataset constructed in this study encompasses a wide range of leaf conditions, with the goal of enhancing the model’s ability to identify the diverse visual traits of tea leaves. The data preparation procedure is presented in
Figure 5. To ensure clean backgrounds and prominent subjects, tea leaves are uniformly arranged on white paper and photographed under controlled conditions. All images are captured using the main camera of an Apple iPhone 14 Pro Max, which has a resolution of 48 megapixels, with the original image resolution set to 4032 × 3024 pixels. During preprocessing, each image was cropped and resized to 640 × 640 pixels to better preserve fine leaf details. A strict screening procedure was applied to eliminate blurred, distorted, or otherwise inappropriate images, ensuring the dataset’s integrity and reliability.
The finalized dataset contains 3313 images, with 2320 allocated for training, 664 for validation, and 329 reserved for testing. Each tea leaf instance within the images is annotated into one of four categories based on appearance characteristics: fine indicates complete, tender leaves with clear edges and uniform color, representing a high-quality appearance; coarse refers to older, rough leaves with damaged or wrinkled edges, indicating a lower visual quality; touching describes leaves that are close to or overlapping each other, resulting in unclear or compressed boundaries; unsure is used when factors such as blur, occlusion, or abnormal lighting prevent an accurate judgment of the appearance. The total number of annotated instances for each category is 9209 (fine), 10,914 (coarse), 66 (touching), and 1530 (unsure).
To mitigate the impact of class imbalance during evaluation and the limitations arising from the relatively small dataset size, several strategies were applied during model training. The training pipeline employed data augmentation strategies, including random flipping and scaling, to increase sample diversity while preserving the key visual characteristics of tea leaves. Meanwhile, the hyperparameters were meticulously adjusted: a smaller learning rate and early stopping were employed to mitigate overfitting, whereas an appropriate batch size combined with multi-scale training contributed to an improved robustness and better generalization across categories. Performance was further assessed using class-wise metrics such as Precision and Recall, ensuring fair and representative measurement for all categories.
2.7. Experimental Environment and Parameter Configuration
Table 1 presents the hardware and software specifications of the computer used for training, together with detailed information about the training environment of the model. The training was conducted with Python version 3.8.19, which was selected on the basis of its compatibility with essential libraries, even though more recent versions of Python were available.
Table 2 summarizes the training parameters adopted in the experimental process.
Figure 6 depicts the training and validation loss curves obtained after 300 epochs. A rapid reduction in all loss components is observed in the early stage, followed by stabilization, which demonstrates steady optimization and robust convergence. The training and validation curves exhibit a strong alignment in both value and trend, suggesting that no significant overfitting or underfitting occurs. The initial rapid reduction and subsequent stabilization of the loss curve confirm both the robustness and effectiveness of the training strategy. The model demonstrates clear convergence and exhibits a strong generalization capability.
4. Discussion
The proposed TeaAppearanceLiteNet strikes an effective balance between lightweight architecture and detection accuracy, demonstrating its feasibility and potential for tea leaf appearance grading. A further point of discussion is its adaptability and possible directions in more complex tasks and broader application scenarios.
Although four categories are adopted as the experimental basis in this study, grading standards in the tea industry are usually more refined and may even involve cross-varietal distinctions. With strong feature extraction and multi-scale representation capabilities, TeaAppearanceLiteNet shows the potential for application in multi-class grading tasks, but validation with larger-scale and more complex tasks is still needed to further establish its applicability.
The dataset employed in this study was collected under relatively standardized conditions, which ensures stability for both training and evaluation. However, the performance of existing methods in real field environments requires further verification, where variations in illumination, leaf posture, and background diversity may introduce new challenges. Future work may consider collecting more diverse samples under natural conditions to comprehensively evaluate robustness and adaptability in complex scenarios.
The data volume used in this study also leaves room for expansion compared with common deep learning tasks. To enhance generalization and performance, future research can apply data augmentation techniques to enrich diversity and progressively include samples with varying illuminations, backgrounds, and leaf conditions. Such efforts would not only improve generalization but also provide a stronger foundation for practical deployment.
From the perspective of industrial application, the lightweight nature of TeaAppearanceLiteNet enables the deployment on embedded platforms and edge computing devices, satisfying the requirements of real-time processing under limited computational resources. It should be noted that applications of computer vision and machine learning in tea leaf grading are still at an exploratory stage, with most work concentrated in academic research. With the rapid progress of intelligent detection technologies in agricultural product grading, it can be anticipated that this approach has considerable potential in tea grading and quality control. If performance stability in complex environments is further improved while maintaining accuracy and efficiency, and adaptation to industrial hardware conditions is achieved, its practical value will become more prominent.
In summary, TeaAppearanceLiteNet not only verifies the effectiveness of lightweight networks for tea leaf appearance inspection but also offers valuable insights for future research and industrial applications. Subsequent work may focus on expanding data diversity, enhancing adaptability in natural environments, and validating industrial applications, thereby advancing the adoption of intelligent detection technologies in the digital transformation of the tea industry.