Multi-Scale Cross-Domain Augmentation of Tea Datasets via Enhanced Cycle Adversarial Networks

Taojie Yu; Jianneng Chen; Zhiyong Gui; Jiangming Jia; Yatao Li; Chennan Yu; Chuanyu Wu

doi:10.3390/agriculture15161739

,

and

¹

School of Mechanical Engineering, Zhejiang Sci-Tech University, Hangzhou 310018, China

²

Provincial Key Laboratory of Agricultural Intelligent Sensing and Robotics Zhejiang Province, Hangzhou 310018, China

³

Zhejiang Ocean University, Zhoushan 316022, China

^*

Author to whom correspondence should be addressed.

Agriculture2025, 15(16), 1739;https://doi.org/10.3390/agriculture15161739

This article belongs to the Section Artificial Intelligence and Digital Agriculture

Version Notes

Order Reprints

Abstract

To tackle phenotypic variability and detection accuracy issues of tea shoots in open-air gardens due to lighting and varietal differences, this study proposes Tea CycleGAN and a data augmentation method. It combines multi-scale image style transfer with spatial consistency dataset generation. Using Longjing 43 and Zhongcha 108 as cross-domain objects, the generator integrates SKConv and a dynamic multi-branch residual structure for multi-scale feature fusion, optimized by an attention mechanism. A deep discriminator with more conv layers and batch norm enhances detail discrimination. A global–local framework trains on 600 × 600 background and 64 × 64 tea shoots regions, with a restoration-paste strategy to preserve spatial consistency. Experiments show Tea CycleGAN achieves FID scores of 42.26 and 26.75, outperforming CycleGAN. Detection using YOLOv7 sees mAP rise from 73.94% to 83.54%, surpassing Mosaic and Mixup. The method effectively mitigates lighting/scale impacts, offering a reliable data augmentation solution for tea picking.

Keywords:

CycleGAN; tea shoots; data augmentation; style transfer; multi-scale feature

1. Introduction

The global tea industry holds significant economic and cultural value, serving as a critical source of income for farmers and driving regional economic development in multiple countries [,,]. The harvesting period for premium Chinese spring green tea typically occurs around March 20 and lasts no more than 20 days. This short window necessitates intensive manual labor, leading to high operational costs compounded by the 8-h daily restriction on daylight harvesting. To address this challenge, researchers have increasingly focused on machine vision technologies to achieve cross-Day–Night precision recognition of tea shoots [,,], aiming to automate harvesting processes and sustainably develop the tea industry.

The availability of high-quality tea shoot detection datasets is fundamental to achieving accurate cross-Day–Night recognition. However, deep learning datasets for tea shoots face challenges beyond labelling complexity and high costs. The most significant issue lies in the substantial phenotypic variations of tea shoots—even within the same cultivar—due to environmental factors and growth stages. Additionally, distinct morphological differences between cultivars render pre-trained models ineffective for novel varieties, necessitating multi-cultivar datasets. For example, Yu et al. [] developed a dataset containing Longjing 43, Zhongcha 108, and Cuifeng cultivars to fine-tune detection models, achieving improved multi-cultivar generalization. Li et al. [] expanded this work by including autumn LJ43 samples, enhancing model robustness across seasons.

While large-scale multi-cultivar datasets enable discrimination between varieties and seasons, they remain insufficient for cross-Day–Night applications and involve prohibitive time and cost. To address this, researchers have applied data augmentation techniques to expand limited tea shoot datasets. Conventional unsupervised methods include basic transformations like brightness, size, contrast, and rotation adjustments. For instance, Xiaoming Wang et al. [] augmented datasets via rotation and mirroring, while Wenkai Xu et al. [] expanded 819 original images to 1400 using 90°/180° rotations and horizontal/vertical flips in 2022. More advanced approaches like Gamut transformation [], Mosaic [], Mixup [], and Cutmix [] have also been adopted. In 2023, Zhiyong Gui et al. [] augmented a 1000-image dataset (1600 × 1200 pixels) using Random, Mixup, Mosaic, and HSV transformations. Yanxu Wu et al. [] incorporated depth maps from RGBD cameras to enhance feature diversity.

However, these methods rely on real-data transformations or specialised data collection, often resulting in underfitting for minority classes. To overcome this, agricultural researchers have explored generative adversarial networks (GANs) [] for data augmentation. Notable examples include Texture Reconstruction Loss CycleGAN (TRL-GAN) [,] for citrus disease detection, GAN-based weed image generation [], spectrogram GAN for wheat kernel classification [], Jujube-GAN with self-attention mechanisms [], LeafGAN for leaf segmentation [], SugarcaneGAN for disease dataset generation [,], SwGAN for rice segmentation [], GANs for generating artificial images of plant seedlings [], GAN-integrated transfer learning frameworks for weed identification [], regression-conditional diffusion models for plant disease graded image generation [], GANs for garlic apex orientation dataset augmentation [], C-GAN for tomato disease image synthesis [], AgI-GAN for high-resolution specialty crop monitoring [], and GANs for visual-to-near-infrared image translation in precision agriculture [].

These applications demonstrate GANs’ potential in agricultural data augmentation, yet key limitations remain in addressing the complex variability specific to tea shoot detection. For instance, TRL-GAN [] focuses on cross-sensor domain transfer but lacks mechanisms for multi-scale feature fusion, and its relatively shallow discriminator limits the ability to capture fine-grained phenotypic variations (e.g., subtle differences in tea leaf morphology). SwGAN [], while incorporating self-attention for global dependency modeling, does not address the critical issue of “local–global consistency” in agricultural scenes, often leading to mismatches between local textures (e.g., leaf veins) and global lighting conditions. LeafGAN [], designed primarily for single leaf morphology augmentation, is not equipped to handle joint domain transfer across different cultivars (phenotypic differences) and varying lighting conditions (e.g., day vs. night). Additionally, GANs for plant seedlings [] struggle to reproduce small shape details (e.g., fine leaf edges), which are crucial for distinguishing tea shoot maturity stages. The GAN-weed identification framework [], though effective for binary classification, fails to handle multi-object scenarios common in dense tea canopies. Regression-conditional diffusion models [] prioritize disease grading over cross-cultivar or cross-lighting transfer, limiting their applicability to tea shoot diversity. Garlic apex GANs [] and tomato C-GAN [] focus on single-species augmentation, lacking adaptability to the phenotypic variations between Longjing 43 and Zhongcha 108. AgI-GAN [] and visual-to-NIR GANs [], while advancing resolution enhancement and spectral translation, overlook spatial consistency between local shoot textures and global backgrounds, which is essential for maintaining realism in tea garden images. Thus, few existing agricultural GANs integrate solutions for both multi-scale feature alignment and spatial structural preservation, which are essential for maintaining the realism of augmented tea shoot images in complex field environments. Despite these advancements, three critical challenges remain in cross-Day–Night tea shoot detection:

(1): Complex Backgrounds: Traditional single-domain GANs struggle with open-air tea garden backgrounds. While CycleGAN shows promise in style transfer, its multi-scale feature extraction and discriminator capabilities require enhancement.
(2): Multi-Objective Preservation: Existing agricultural GANs focus on single-object (e.g., diseased leaves) style transfer. Tea shoot detection demands preserving both complex backgrounds and multi-object bounding boxes, an understudied area.
(3): Cultivar-Specific Generalization: No systematic method exists for cross-cultivar style transfer to address cultivar-specific model limitations.

To tackle these challenges, this study introduces three innovations:

(1): Tea CycleGAN Architecture: Incorporates SKNet modules into the generator for multi-scale feature fusion and enhances discriminator depth to improve complex scene parsing.
(2): Hierarchical Style Transfer: Implements global–local collaborative training to preserve multi-scale features, generating high-fidelity synthetic datasets.
(3): Cross-Day–Night and Cross-Variety Framework: Develops a systematic data augmentation pipeline validated against state-of-the-art methods to support cross-Day–Night detection.

Therefore, the primary objective of this study is to develop and validate a novel GAN-based framework specifically designed to overcome the critical challenges of cross Day-Night and cross-cultivar tea shoot detection. We hypothesize that:

(1): The proposed Tea CycleGAN architecture, with enhanced multi-scale feature fusion (via SKNet) and a deeper discriminator, will significantly improve the generation of realistic synthetic tea shoot images under both day and night conditions, particularly in complex open-air backgrounds compared to traditional methods;
(2): The hierarchical style transfer strategy will effectively preserve both the global scene context and local object details (including bounding box integrity for multi-shot scenarios) during domain adaptation;
(3): The systematic cross-Day–Night and cross-variety data augmentation pipeline will demonstrably enhance the generalization capability and detection accuracy of deep learning models for previously unseen cultivars and lighting conditions, surpassing the performance achievable with conventional augmentation methods and existing agricultural GANs. The remainder of this paper is structured as follows: Section 2 details the Tea CycleGAN architecture and hierarchical transfer method; Section 3 presents ablation studies and comparative analyses; Section 4 concludes the work and outlines future directions in dynamic domain adaptation.

2. Materials and Methods

2.1. Overall Research Plan

This section outlines the overall research framework, clarifying the logical connections between key stages via a workflow integrating cross-domain synthesis and tea shoot detection. The framework comprises five interrelated phases:

Data collection and preprocessing: Raw images of two tea cultivars (Longjing 43 and Zhongcha 108) were collected under daytime and nighttime conditions. Tea shoot positions were annotated, and datasets were split into training/validation sets to support subsequent model training.
Tea CycleGAN design: An improved CycleGAN architecture was developed, integrating SKConv for multi-scale feature fusion and a “restoration-paste” strategy to preserve spatial consistency. Loss functions (cycle consistency, adversarial loss) and training strategies (learning rate scheduling, gradient clipping) were tailored for agricultural scenarios.
Cross-domain augmentation: The trained Tea CycleGAN generated augmented images across domains: cross-cultivar (e.g., Longjing 43→Zhongcha 108 style) and cross-Day–Night (e.g., daytime→nighttime style), expanding the dataset to cover underrepresented scenarios.
Performance validation: Synthesis quality was evaluated via FID (distribution similarity) and MMD (feature alignment). Downstream detection performance was assessed using YOLOv7, with metrics including mAP, Precision, and Recall, to verify if augmented data enhances model generalization.
Comparative analysis: Tea CycleGAN was compared with traditional augmentation methods (Mosaic, Mixup) and state-of-the-art GANs (CycleGAN, DCGAN) in terms of synthesis quality and detection improvement, validating its superiority.

This framework ensures a systematic workflow, where each phase builds on the previous one to achieve the goal of enhancing tea shoot detection via cross-domain synthesis.

2.2. Data Collection

A total of 2000 images (1000 each for Longjing 43 and Zhongcha 108 cultivars) were acquired using a ZIVID Two industrial RGB-D camera, which manufactured by ZIVID, a Norwegian company specializing in 3D machine vision solutions. [] at the Tea Research Institute of the Chinese Academy of Agricultural Sciences. All samples are 1920 × 1080 pixels, acquired in late March, Spring 2023, and cropped to 600 × 600 pixels. All the shots were taken on sunny days. The shooting Angle was 45° downward, about 0.5 m from the tea tree. The weather was clear and rain-free at the time of data collection. As shown in Figure 1, Longjing 43 samples were captured under natural daylight conditions to represent typical daytime scenarios, while Zhongcha 108 images were obtained at night with integrated camera-based supplementary lighting to simulate low-light environments. All images were manually annotated using LabelImg 1.8.6 software to generate PASCAL VOC-formatted datasets containing bounding boxes for tea shoots (Table 1).

Figure 1. (a) Longjing 43 under daylight conditions, (b) Zhongcha 108 under supplementary night lighting.

Table 1. Dataset Composition and Partitioning.

Morphological analysis revealed significant phenotypic variations between cultivars under controlled lighting conditions (Figure 2). Longjing 43 plants displayed semi-spreading canopies with elliptical, dark green, glossy leaves and short, robust shoots characterized by moderate pubescence. In contrast, Zhongcha 108 exhibited semi-spreading canopies with long-elliptical green leaves featuring slightly raised surfaces and yellow-green shoots with sparse pubescence. These cultivar-specific traits were further amplified by lighting differences, necessitating simultaneous style transfer of both background and tea shoot regions to maintain spatial consistency during data augmentation.

Figure 2. Localized shoot images demonstrating cultivar-specific morphological characteristics under different lighting regimes.

2.3. Model

2.3.1. CycleGAN

CycleGAN (Cycle-Consistent Generative Adversarial Network), proposed by Jun-Yan Zhu et al. in 2017 [], represents a significant advancement in unsupervised image-to-image translation. This model employs a dual-generator–dual-discriminator architecture where generator GAB learns to translate images from domain A (e.g., daytime Longjing 43) to domain B (e.g., nighttime Zhongcha 108), while generator GBA performs the inverse mapping from B to A. Two discriminators, DA and DB, are trained to distinguish real/fake images in their respective domains, enabling unsupervised learning through cycleconsistency loss. This loss function ensures that translating an image from A→B→A (or B→A→B) reconstructs the original image, preserving essential semantic information.

Originally designed for style transfer, seasonal conversion, and object category translation, CycleGAN has been adapted in this study to address agricultural data augmentation challenges. By treating daytime Longjing 43 and nighttime Zhongcha 108 as distinct domains (Figure 3), the model generates synthetic images that retain cultivar-specific phenotypic features while altering lighting conditions. This approach effectively expands the dataset for cross-Day–Night detection tasks by creating realistic variations that bridge the gap between diurnal and nocturnal imaging scenarios.

Figure 3. Schematic representation of the Tea CycleGAN architecture with bidirectional domain translation.

2.3.2. Generator Combined with SkNet Attention Mechanism

In image generation, generator design plays a critical role in determining output quality and detail preservation. This study introduces a novel generator architecture incorporating Selective Kernel Convolution (SKConv) [] and a specialized residual module to enhance detail and texture processing capabilities, particularly in tea shoot image generation.

The SKConv module, a key component of the generator, employs a multi-branch convolutional structure to capture multi-scale features while dynamically selecting optimal representations through attention mechanisms (Figure 4). Specifically, the module first reduces input feature map dimensions via a 1 × 1 convolution layer. Three parallel branches with 3 × 3, 5 × 5, and 7 × 7 convolutional kernels then extract multi-scale features. These features are concatenated and processed through two 1 × 1 convolution layers to generate attention vectors, which dynamically weight branch outputs for final feature map synthesis. This design enables adaptive multi-scale feature selection, improving the generator’s ability to handle tea shoot texture details and overall structural variations.

Figure 4. Selective Kernel Convolution (SKConv) unit architecture.

Each convolutional branch in SKConv includes instance normalization and ReLU activation to stabilize feature extraction and introduce non-linearity. The attention mechanism computes channel-wise weights by globally pooling spatial dimensions, followed by two fully connected layers with a reduction ratio of 16. The first layer uses ReLU activation to introduce sparsity, while the second applies a sigmoid function to scale weights between 0 and 1.

To further enhance feature learning, a residual module integrating SKConv and skip connections is proposed (Figure 5). This module maintains input information integrity while augmenting multi-scale representation through sequential operations: initial channel adjustment via 1 × 1 convolution, multi-scale feature extraction with dynamic attention weighting, and residual connection combining original and processed features. This design improves training stability by facilitating gradient flow and preserving low-level details critical for tea shoot detection.

Figure 5. Residual block incorporating SKConv for enhanced feature learning.

The residual block structure ensures that the generator retains both high-frequency details (e.g., leaf edges and vein patterns) and low-frequency contextual information. Specifically, the input tensor first undergoes a 1 × 1 convolution to match the channel dimension of the SKConv output. After passing through the SKConv module, the processed features are normalized using instance normalization and activated via ReLU before being added back to the original input.

Given significant size variations between whole tea plants and individual shoots (Figure 6), the generator architecture is optimized with scalable SKNet structures. The modified generator consists of input and output layers, downsampling/upsampling blocks, and residual blocks containing 1–9 SKConv units. This hierarchical configuration ensures robust feature extraction across multiple spatial scales, enabling effective handling of both macro-level background contexts and micro-level shoot details in agricultural imaging scenarios.

Figure 6. Improved generator architecture with SKConv-enhanced residual blocks.

The generator follows a U-Net-like architecture with symmetric downsampling and upsampling paths. The downsampling blocks use stride-2 convolutions to reduce spatial dimensions while increasing channel counts (64→128→256), capturing hierarchical features. Upsampling blocks employ transposed convolutions with stride-2 to restore spatial resolution (256→128→64). Between these blocks, 3–9 residual blocks with SKConv modules are stacked to refine multi-scale features, with the exact number determined by input image size and complexity.

2.3.3. Enhanced Discriminator Architecture

The original CycleGAN discriminator employs a PatchGAN architecture that outputs a fixed-size matrix to evaluate local image patch authenticity. Comprising stacked convolutional layers with incrementally increasing channels, this structure uses Leaky ReLU activation and batch normalization to enhance representational capacity and training stability. However, its relatively shallow architecture may struggle to capture high-level features and subtle textures in complex agricultural imagery.

To address these limitations, an improved discriminator design is proposed (Figure 7). The enhanced architecture deepens the network by adding convolutional layers while maintaining PatchGAN’s patch-based evaluation mechanism. Specifically, the modified discriminator consists of five convolutional layers with sequentially increasing channel depths: 64, 128, 256, 512, and 1. Batch normalization layers are incorporated after the second, third, and fourth convolutional layers to mitigate internal covariate shift and accelerate convergence. Leaky ReLU activation functions are retained throughout to preserve gradient flow in negative regions.

Figure 7. Architecture of the enhanced discriminator with deep convolutional layers.

The discriminator’s convolutional layers use 4 × 4 kernels with stride-2 for the first three layers (downsampling) and stride-1 for the last two layers (feature refinement). This configuration creates a receptive field of approximately 70 × 70 pixels, enabling fine-grained texture analysis. The final 1 × 1 convolution map features a single-channel output, which is interpreted as the probability of patch authenticity using a sigmoid function in training.

This deepened discriminator offers multiple advantages. Deeper network architecture enables the capture of more abstract and complex features critical for distinguishing subtle differences between real and synthetic tea images, including texture details, color distributions, and morphological characteristics. Batch normalization stabilizes training dynamics by normalizing activation statistics across layers, while Leaky ReLU ensures non-vanishing gradients in negative regions. Collectively, these modifications improve discriminator sensitivity to cultivar-specific traits and lighting variations.

The discriminator’s increased depth allows it to model hierarchical feature representations: early layers capture low-level edges and textures, middle layers encode mid-level patterns (e.g., leaf shapes), and late layers integrate high-level semantic information (e.g., shoot arrangement). This hierarchical processing is particularly effective for tea images, where authenticity depends on both micro-scale details (e.g., trichome density) and macro-scale structures (e.g., branching patterns).

In tea image generation tasks, the enhanced discriminator significantly improves adversarial training dynamics. By enforcing stricter authenticity evaluation, it drives the generator to produce higher-fidelity synthetic images with realistic textures, natural color transitions, and accurate morphological representations. This is particularly evident in cross-domain style transfer scenarios, where the discriminator effectively discriminates between cultivar-specific shoot features under different lighting conditions.

2.4. Multi-Scale Style Transfer-Based Data Augmentation Framework

Building upon the enhanced CycleGAN architecture, this study introduces a hierarchical data augmentation framework (Figure 8) designed to address agricultural imaging challenges through multi-scale style transfer. The proposed method trains the modified CycleGAN on paired daytime Longjing 43 and nighttime Zhongcha 108 datasets to learn bidirectional domain mappings that preserve both macro-level background contexts and micro-level shoot details.

Figure 8. Algorithm framework of a multi-scale style transfer data augmentation method. The red box is the labeled area of “tea shoot” in the dataset.

The workflow involves dual-scale training where the generator processes both full-resolution images (600 × 600 pixels) and localized tea shoots regions (64 × 64 pixels) extracted from bounding boxes. This approach ensures the model captures both global canopy structures and fine-scale shoot features under different lighting conditions. After style transfer, synthetic shoot images are seamlessly integrated into their original background positions using a restoration-paste algorithm, maintaining spatial consistency while enhancing phenotypic traits through adversarial training.

Pseudocode for the restoration-paste strategy:

Input: original_image (600 × 600), bounding_boxes (list of (x, y, w, h)),
  generator_full (trained for 600 × 600 style transfer),
  generator_shoot (trained for 64 × 64 shoot transfer)
Output: augmented_image (600 × 600 with style-transferred background and shoots)
1. # Extract real shoots and record coordinates
  real_shoots = []
  shoot_coords = []
  for bbox in bounding_boxes:
  x, y, w, h = bbox
  shoot = original_image [y:y + h, x:x + w] # Crop shoot from original image
  real_shoots.append (shoot)
  shoot_coords.append ((x, y, w, h))
2. # Generate style-transferred background and shoots
  transferred_background = generator_full (original_image) # Global style transfer
transferred_shoots = []
  for shoot in real_shoots:
  resized_shoot = resize (shoot, (64, 64)) # Resize to model input size
  transferred_shoot_64 = generator_shoot (resized_shoot) # Shoot style transfer
  transferred_shoots.append (transferred_shoot_64)
3. # Paste transferred shoots to original positions with edge blending
  augmented_image = transferred_background.copy ()
  for i in range (len(transferred_shoots)):
  x, y, w, h = shoot_coords [i]
  # Resize transferred shoot back to original bounding box size
  transferred_shoot = resize (transferred_shoots [i], (h, w))
  # Create Gaussian mask for edge blending
  mask = create_gaussian_mask ((h, w), kernel_size = 3, sigma = 0.5)
  # Blend transferred shoot with background
  augmented_image [y:y + h, x:x + w] = (transferred_shoot ∗ mask) + \
         (augmented_image [y:y + h, x:x + w] ∗ (1 − mask))
Return augmented_image

This hierarchical strategy generates synthetic images with improved texture fidelity and morphological accuracy by explicitly modeling both global and local features. Separate model weights are dynamically applied during augmentation to produce context-aware synthetic data compatible with existing annotation workflows. Experimental validation demonstrates that this method improves detection performance by 9.6% in mean average precision compared to state-of-the-art augmentation techniques, highlighting its effectiveness for agricultural computer vision tasks.

2.5. Classical Object Detection Network

This study employs YOLOv7 [], a state-of-the-art object detection framework, as the test model to evaluate the effectiveness of different data augmentation methods on tea shoot detection tasks. YOLOv7 achieves superior performance through innovations including model re-parameterization, optimized label assignment strategies, efficient network architecture design, and integration of SAT, SPP, and CmBN technologies. These advancements enable YOLOv7 to deliver both high detection accuracy and fast inference speed, making it suitable for real-world agricultural applications. The model serves as a benchmark for comparing the performance gains of the proposed method against traditional augmentation techniques in deep learning-based detection tasks.

2.6. Experimental Setup

2.6.1. Training Environment

All experiments were conducted on a computer system featuring an Intel i7-12700F CPU, NVIDIA GeForce RTX 4060 12 GB GPU, and 32 GB RAM. The Ubuntu 18.04 LTS operating system was used with Python 3.8 and PyTorch 1.11 to establish the deep learning environment.

2.6.2. Training Parameters

Adversarial network training utilized 600 × 600-pixel full images and 64 × 64-pixel localized tea shoots regions as input, maintaining RGB color channels and a batch size of 16. The training process spanned 1000 epochs with a noise dimension of 100 and a sampling interval of 100 iterations. For the GAN models (Tea CycleGAN and variants), the optimization setup included the Adam optimizer with a learning rate of 0.0002, betas set to (0.5, 0.999), and weight decay of 0.0001, where generators (G_A2B and G_B2A) were optimized jointly and discriminators (D_A and D_B) were optimized separately. A StepLR scheduler with a step size of 100 epochs and a decay factor of 0.5 was employed to reduce the learning rate by half every 100 epochs, and gradient norms of generators and discriminators were clipped to a maximum value of 0.5 during backpropagation. The loss function for GAN training combined adversarial loss (MSELoss with weight 1.0), cycle consistency loss (L1Loss with total weight 10.0 for bidirectional transfer), and identity loss (L1Loss with total weight 5.0 for intra-domain preservation). For YOLOv7 detection model training, identical batch size and 1000 epochs configurations were applied without architectural modifications or pre-trained weights.

2.6.3. Evaluation Metrics

Frechet Inception Distance (FID) [] was computed using Inception-v3 features to measure distributional similarity between real and synthetic images. The FID formula quantifies the squared difference between feature means and covariance matrix divergence.

FID = ‖μg − μr‖² + Tr(Σg + Σr − 2ΣgΣr)

(1)

where μg and μr denote generated/real feature means, and ΣgΣr represent their covariance matrices. The lower the FID value, the more similar the distribution of the generated image to the real image.

Maximum Mean Discrepancy (MMD) was employed to further quantify the distributional alignment between generated and real images, focusing on measuring the distance between two probability distributions in the Reproducing Kernel Hilbert Space (RKHS). For two sets of samples {x__i} (generated images, i = 1, …, n) and {y__j} (real images, j = 1, …, m) with features extracted by Inception-v3, MMD is defined as:

M M D = {‖(1 / n) \sum_{i = 1}^{n} φ (x_{i}) - (1 / m) \sum_{i = 1}^{m} φ (y_{i})‖}_{h}

(2)

where φ(·) denotes the mapping function induced by a Gaussian kernel (k(x, y) = exp(−‖x − y‖²/(2σ²)) with σ as the kernel bandwidth), and ‖·‖_h represents the norm in RKHS. A smaller MMD value indicates a closer alignment between the feature distributions of generated and real images, complementing FID by emphasizing local structural similarities in feature spaces. t-Distributed Stochastic Neighbor Embedding (t-SNE) was employed to visually validate cross-domain feature distribution alignment, complementing FID and MMD with qualitative and quantitative insights. Two key metrics derived from t-SNE embeddings are defined as follows:

Reflects the compactness of clusters in the low-dimensional embedding, calculated as the average bandwidth of Gaussian kernels used to model pairwise similarities in the high-dimensional feature space. For N samples, the mean sigma is:

M e a n s i g m a = \frac{1}{N} \sum_{i = 1}^{N} σ_{i}

(3)

where

σ_{i}

denotes the kernel bandwidth for sample i, determined by the perplexity parameter to balance local and global neighborhood relationships. Smaller values indicate tighter clustering and better cross-domain alignment.

Measures the discrepancy between the high-dimensional feature distribution (P) and its low-dimensional embedding distribution(Q):

K L (P ‖ Q) = \sum_{i < j} P_{i j} \log (\frac{P_{i j}}{Q_{i j}})

(4)

Object detection performance was evaluated using precision (P), recall (R), and average precision (AP), as Equations (5)–(7):

P = \frac{T P}{T P + F P}

(5)

R = \frac{T P}{T P + F N}

(6)

A P = \int_{0}^{1} P (R) d R

(7)

Here, TP/FP/FN represent true positives (correct positive detections), false positives (incorrect positive predictions), and false negatives (undetected true positives), respectively. Precision measures the proportion of correct positive predictions among all positive detections, while recall quantifies the proportion of true positives successfully identified. Average precision (AP) integrates precision over all possible recall thresholds, providing a comprehensive measure of detection performance.

3. Results and Discussion

3.1. Comparative Evaluation of GANs

To validate both the robustness of experimental results and the improvements of Tea CycleGAN over CycleGAN and other GAN variants [,], comparative experiments were conducted with five independent repetitions under different random initializations. Statistical rigor was ensured using metrics including mean ± standard deviation, 95% confidence intervals, and two-tailed independent samples t-tests (significance level: p < 0.05) to assess the significance of performance differences. Evaluations focused on two key metrics: Fréchet Inception Distance (FID) and Maximum Mean Discrepancy (MMD), applied to both 600 × 600-pixel full images and 64 × 64-pixel localized tea shoots regions. FID quantifies overall distributional similarity between real and synthetic images, while MMD measures alignment of feature distributions in the Reproducing Kernel Hilbert Space—both metrics use lower values to indicate better fidelity, with MMD particularly sensitive to local structural consistency in feature spaces. This comprehensive framework ensures that observed performance differences are both statistically significant and meaningful for assessing generative quality. The experimental results are presented in Table 2 and Table 3.

Table 2. FID and MMD Comparison of Different GAN Architectures for 600 × 600 Images.

Table 3. FID and MMD Comparison of Different GAN Architectures for 64 × 64 Images.

As demonstrated in Table 2 and Table 3, cyclic generative adversarial networks consistently outperformed single-domain GANs across both scales. Tea CycleGAN achieved FID scores of 42.26 ± 1.43 (95% CI: [41.08, 43.44]) for full images and 26.75 ± 1.03 (95% CI: [25.95, 27.55]) for tea shoots regions, representing 43.94% and 53.48% reductions, respectively, compared to the original CycleGAN (75.38 ± 2.56 and 57.51 ± 1.94, respectively).

Similarly, in MMD—which emphasizes feature distribution alignment—Tea CycleGAN achieved 0.02241 ± 0.0015 (full images, 95% CI: [0.0212, 0.0236]) and 0.02452 ± 0.0011 (tea shoots regions, 95% CI: [0.0237, 0.0253]), outperforming all variants: 31.05% lower than CycleGAN + SKConv (0.03263 ± 0.0021) and 63.65% lower than the original CycleGAN (0.06137 ± 0.0032) for full images; 3.04% and 13.91% lower for tea shoots regions, respectively.

Notably, Tea CycleGAN exhibited the smallest standard deviations and narrowest confidence intervals across both metrics, indicating superior stability across five repeated experiments, which is critical for reliable agricultural image synthesis. In contrast, single-domain GANs like WGAN showed larger variability (e.g., 600 × 600 FID: 254.76 ± 4.53, 95% CI: [251.01, 258.51]), reflecting inconsistent performance. Even incremental improvements yielded measurable gains in MMD: CycleGAN + SKConv reduced MMD by 47.23% (full images) and 10.99% (tea shoots regions) compared to the original CycleGAN, while CycleGAN + Improved Discriminator showed 3.16% and 10.74% reductions. These results confirm that both multi-scale feature fusion (via SKConv) and enhanced domain discrimination (via deeper discriminators) contribute to tighter alignment, with their combined effect in Tea CycleGAN yielding the most significant and stable improvements. The superior MMD performance of Tea CycleGAN stems from its ability to preserve both global structure and local phenotypic features: SKConv dynamically fuses multi-scale traits (e.g., leaf veins, shoot texture), while the deepened discriminator minimizes subtle distribution gaps.

Collectively, FID and MMD results demonstrate that Tea CycleGAN not only improves visual fidelity but achieves tighter, more stable distributional alignment in feature space, validating the synergistic effect of its architectural innovations for agricultural image synthesis. Visual analysis of Table 4 reveals that cyclic architecture preserves critical details (e.g., leaf gloss, color transitions) versus blurry outputs from single-domain GANs, with Tea CycleGAN uniquely capturing nuanced cultivar-specific color variations.

Table 4. 600 × 600 Image Generation Comparison.

At the 64 × 64 scale (Table 5), all cyclic models maintain acceptable performance, with Tea CycleGAN producing sharper edges and more defined textures compared to other variants. These improvements highlight the robustness of the proposed architecture across different spatial scales.

Table 5. 64 × 64 Image Generation Comparison.

The following Figure 9 shows the training functions of cycleGAN and tea cycleGAN.

Figure 9. Loss function graph.

3.2. Influence of Shoot Placement Methods on Virtual Dataset Effectiveness

In this experiment investigating the impact of shoot placement methods on YOLOv7 detection performance in large tea plant backgrounds, three methods were compared: restoration-paste, fully random pasting, and probability-based pasting constrained by dataset statistics. Analysis of the Longjing 43 dataset revealed key bounding box distribution patterns including size, position, quantity, and aspect ratio (Figure 10). Using these constraints, synthetic shoots were generated with a base width of 40 pixels, width variations between 0.2–1.8, aspect ratios of 2 ± 0.5, and 0–40 instances per image.

Figure 10. Parameter analysis of the Longjing 43 dataset.

The integration of stylized shoots into background images represents a critical step in agricultural data augmentation. Different implementation methods significantly affect detection performance. This study evaluated three methods—fully random pasting, probability-based pasting, and restoration-paste—using YOLOv7’s mean average precision (mAP) as the primary metric (Table 6).

Table 6. Comparison of data placement method s’ impact on object detection performance.

Restoration-paste demonstrated significant advantages by preserving original spatial relationships between shoots and backgrounds, achieving high consistency with real scenarios. This consistency provided accurate training data, enabling better feature learning, thus improving YOLOv7 detection performance. Fully random pasting performed poorly due to ignoring spatial relationships, creating synthetic inconsistencies that hindered feature learning. Probability-based pasting showed intermediate results by incorporating dataset statistics while still retaining residual randomness affecting performance (Table 7).

Table 7. Comparison of the gain effect of various data enhancement methods on target detection.

These results confirm restoration-paste as the optimal method for maintaining spatial integrity during data augmentation. Its preservation of original relationships and consistency with real scenarios provides strong support for YOLOv7 performance improvements. This experimental outcome offers a critical reference for subsequent data processing and model training to optimize practical application performance.

3.3. Comparative Testing of Data Augmentation Methods

In this experiment, the daytime LJ43 is used as a strange data set, and the nighttime Zhongcha 108 data set is used as a mature existing data set; the data augmentation method in this paper is applied as an additional data source for daytime LJ43 to verify the effectiveness of cross-Day–Night and cross-variety data augmentation.

Current data augmentation methods mainly include Mixup and Mosaic. To validate the effect of the proposed multi-scale style transfer data augmentation method, multiple datasets were created using common data augmentation methods as control datasets for this method. Each dataset consists of 500 real LJ43 images and 500 augmented images generated from real data. Meanwhile, an unaugmented dataset was also used as a control group. Using YOLO v7 as the object detection test model, the experimental results in Table 8 were obtained.

Table 8. Comparison of data augmentation methods’ impact on object detection performance.

Comparison experiments based on the YOLO v7 model show that the proposed multi-scale data augmentation method significantly improves object detection performance. Compared with unaugmented data (mAP = 73.94%, P = 75.25%, R = 45.25%), the proposed method improved the three indicators to 83.54% (+9.6), 86.03% (+10.78), and 55.18% (+9.93), respectively, with gain amplitudes exceeding Mosaic (mAP = 80.13%, P = 83.52%, R = 50.92%) and Mixup (mAP = 78.93%, P = 81.49%, R = 48.74%). This result validates the dual advantages of Tea CycleGAN and multi-scale style transfer data augmentation.

On one hand, the SKnet-based generator in Tea CycleGAN dynamically fuses target features at different scales, enhancing the model’s adaptability to size variations; the enhanced generator in Tea CycleGAN improves the model’s ability to learn texture details and color features. However, having only a high-performance transfer network is insufficient for high-quality data augmentation. For example, the None + Realdata group used direct whole-image style transfer for data augmentation. Although results still improved compared to the original data group, all indicators significantly lagged behind the Mosaic + Realdata and Mixup + Realdata groups.

Therefore, on the other hand, this study adopted a multi-scale image style transfer method for data generation, which involves simultaneously transferring the styles of both whole images and tea shoot regions before restoration. Comparing with existing methods, while Mosaic’s composite scenes generated by four-image stitching can improve model robustness, its random stitching may introduce irrelevant semantic noise; Mixup’s linear interpolation easily causes target edge blurring, especially exacerbating classification ambiguity in dense target scenarios. Examples of images generated by the three data augmentation methods are shown in Table 9.

Table 9. Comparison of images generated by different data augmentation methods.

To visually demonstrate the effect, two randomly selected test images were used as detection performance examples. Figure 11 shows the detection effect comparison of different data augmentation methods. Red boxes indicate correct detections, while blue boxes indicate undetected tea shoots. Group a is the ground truth with 29 tea shoots; group b (proposed method) undetected 4; group c (Mosaic) undetected 5; group d (Mixup) undetected 6; group e (non-multi-scale style transfer) undetected 8; group f (no augmentation) undetected 9. Therefore, the method in this paper uses the existing mature dataset (Zhongcha108) to perform multi-scale style transfer and generate virtual high-fidelity data. Compared with other data augmentation methods, the proposed method can improve the detection effect of strange tea data (Longjing43) more intuitively, and has strong data augmentation ability.

Figure 11. Detection Effect Comparison Diagram. (Red boxes indicate correct detections, while blue boxes indicate undetected tea shoots). (a) Ground truth, containing 29 tea shoots in total; (b) Detection result of the proposed method, with 4 undetected tea shoots; (c) Detection result of the Mosaic augmentation method, with 5 undetected tea shoots; (d) Detection result of the Mixup augmentation method, with 6 undetected tea shoots; (e) Detection result of the non-multi-scale style transfer augmentation method, with 8 undetected tea shoots; (f) Detection result without data augmentation, with 9 undetected tea shoots.

3.4. t-SNE Visual Embedding Analysis for Domain Alignment

To quantitatively and visually validate the effectiveness of Tea CycleGAN in aligning feature distributions across domains (cultivar and day-night variations), t-SNE (t-distributed Stochastic Neighbor Embedding) visual embedding experiments were conducted. This analysis aimed to demonstrate how the proposed method narrows the distribution gap between source and target domains compared to baseline methods. The experiments focused on one domain pair: cross-cultivar and Day–Night. Features of all images (original and augmented) were extracted using the pre-trained Inception-v3 model (consistent with Section 2.6.3), resulting in 2048-dimensional feature vectors. t-SNE was configured with a perplexity of 30, early exaggeration of 12, learning rate of 200, and 1000 iterations, with comparisons made between two conditions: original images without augmentation and Tea CycleGAN.

Two key metrics were used to evaluate t-SNE embedding quality and distribution alignment: mean sigma, reflecting the compactness of local neighborhoods (smaller values indicate more concentrated distributions); and KL divergence, quantifying the overall similarity between the embedded distribution and the original high-dimensional feature distribution (smaller values indicate better preservation of high-dimensional structure). The results, presented in Table 10, reveal significant improvements in distribution alignment achieved by Tea CycleGAN:

Table 10. t-SNE Embedding Metrics for Domain Alignment.

For cross-cultivar alignment (Real LJ43 vs. Real ZC108), Tea CycleGAN reduced mean sigma by 16.7% and KL divergence by 14.3% compared to the original domains, indicating tighter clustering and better preservation of high-dimensional structure. Within-domain comparisons (e.g., Real LJ43 vs. Fake LJ43) showed that augmented samples closely matched their real counterparts, with KL divergence values approaching those of intra-domain comparisons (e.g., 0.490285 vs. 0.410419 for LJ43). These results confirm that Tea CycleGAN effectively aligns feature distributions across cultivars while maintaining intra-domain consistency.

Corresponding visualizations (Figure 12) further illustrate these improvements, with Tea CycleGAN-augmented samples forming cohesive clusters adjacent to their target domains, unlike the dispersed clusters of original samples. The reduced mean sigma and KL divergence values validate that the integration of SKConv and restoration-paste strategies enhances cross-domain feature alignment, enabling the model to generate synthetic images that closely mimic the statistical properties of real agricultural data.

Figure 12. t-SNE visualization.

4. Discussion

The proposed multi-scale Tea CycleGAN framework demonstrates significant improvements in cross-domain style transfer for agricultural imaging, particularly in preserving both macro-context and micro-level tea shoot details. The integration of SKConv modules in the generator allows for the dynamic selection of optimal kernel sizes, effectively capturing hierarchical features such as leaf venation and shoot pubescence, which are critical for cultivar discrimination. This architectural innovation contributes to the 42.26 ± 1.43 FID at the 600 × 600 scale compared to the original CycleGAN, indicating enhanced structural fidelity in synthetic backgrounds. Meanwhile, the deepened discriminator with batch normalization layers improves sensitivity to subtle lighting variations, as evidenced by the 26.75 ± 1.03 FID at the 64 × 64 shoot scale. These results highlight the importance of balanced multi-scale feature learning in complex agricultural scenes.

The restoration-paste strategy plays a pivotal role in maintaining spatial consistency between synthetic shoots and backgrounds. Unlike random pasting methods that introduce semantic noise, this approach preserves the original positional relationships of shoots, aligning with real-world distributions (Figure 9). The detection performance gains (mAP + 9.6%) validate that preserving such contextual information is crucial for improving model generalization, particularly in dense canopy scenarios where overlapping shoots are prevalent. This finding underscores the need for context-aware augmentation methods in agricultural computer vision, where background complexity often confounds traditional data augmentation techniques.

Comparative experiments with Mosaic and Mixup further emphasize the superiority of style transfer-based augmentation. While Mosaic enhances scene diversity through image stitching, it introduces irrelevant composite backgrounds that may mislead the model. Mixup, on the other hand, causes edge blurring and label ambiguity due to linear interpolation, which is particularly detrimental for small objects like tea shoots. Notably, self-supervised learning-enhanced GANs (e.g., SimCLR- or DINO-integrated frameworks) have shown promise in improving feature alignment via contrastive pretraining, leveraging unlabeled data to capture generic visual patterns. However, these methods often lack customization for agricultural scenarios, where “phenotype-illumination joint variation”—simultaneous changes in traits (e.g., shoot maturity) and lighting (e.g., day/night)—requires targeted handling. In contrast, Tea CycleGAN’s SKConv-driven multi-scale fusion and discriminator depth are explicitly designed to model this agricultural-specific variability, outperforming generic self-supervised GANs in preserving tea shoot phenotypic details (e.g., trichome density) under dynamic lighting. In contrast, Tea CycleGAN generates synthetic images with sharp edges and natural color transitions (Table 9), maintaining high-quality features essential for accurate detection. These results suggest that adversarial training combined with hierarchical style transfer is better suited for agricultural applications requiring fine-grained feature preservation.

Despite these advancements, the current framework faces limitations. The computational cost of dual-scale training (600 × 600 and 64 × 64) remains high, necessitating further optimization for real-time agricultural systems. Benchmark tests on our experimental setup (NVIDIA RTX 4060 GPU) revealed that Tea CycleGAN required 38.7 ± 1.2 h for full training (1000 epochs), with inference times of 127 ± 4 ms per 600 × 600 image (7.8 FPS) and 15 ± 1 ms per 64 × 64 shoot patch (66.7 FPS). GPU utilization peaked at 89% during generator updates due to SKNet’s multi-branch convolutions. While sufficient for offline dataset augmentation, these metrics indicate challenges for real-time field deployment, particularly on embedded platforms common in agricultural robotics. Additionally, while cross-cultivar and cross-daylight transfer is validated, the model’s robustness across seasons and varying weather conditions remains untested. Moreover, extending beyond RGB domains presents a promising avenue: Rana et al. [] recently demonstrated that GAN-based multispectral synthesis (e.g., RGB-infrared weed images) can enhance agricultural detection robustness, highlighting the value of non-visible spectra in capturing plant physiological traits (e.g., chlorophyll content via red-edge bands). Tea CycleGAN’s modular architecture—particularly its scalable SKConv blocks and adaptive feature alignment—positions it well for such extension, with potential to model cross-spectral dependencies (e.g., visible-near-infrared correlations in tea leaves) while preserving phenotypic fidelity, thus addressing a critical gap in current agricultural multispectral synthesis. Future research should explore adaptive learning strategies to dynamically adjust scale-specific feature weights during training, reducing computational overhead while maintaining performance. Moreover, incorporating temporal features from multi-seasonal datasets could enhance model adaptability to environmental variations, addressing the challenge of dynamic agricultural scenarios.

5. Conclusions

This study proposed Tea CycleGAN, a multi-scale style transfer framework, to address cross-Day–Night and cross-cultivar challenges in tea shoot detection. Tea CycleGAN outperformed baseline models across both 600 × 600 and 64 × 64 scales. For 600 × 600 images, it achieved an FID score of 42.26 ± 1.43 (95% CI: [41.08, 43.44]) and MMD of 0.02241 ± 0.0015, representing a 33.3% lower FID than the original CycleGAN (75.38 ± 2.56) and a 63.65% lower MMD. For 64 × 64 localized tea shoot regions, its FID of 26.75 ± 1.03 (95% CI: [25.95, 27.55]) and MMD of 0.02452 ± 0.0011 showed a 53.5% and 13.91% improvement over CycleGAN, respectively. These results validate the effectiveness of integrating SKConv (for multi-scale feature fusion) and a deepened discriminator (for enhanced detail discrimination). The restoration-paste strategy preserved spatial consistency, enabling YOLOv7 to achieve an mAP of 83.54%, surpassing random pasting (mAP: 61.48%) and probability-based pasting (mAP: 69.19%). Compared to Mosaic (80.13%) and Mixup (78.93%), the framework improved mAP by 9.6% (from 73.94%). t-SNE analysis further confirmed better domain alignment, with 16.7% lower mean sigma and 14.3% lower KL divergence. Scientifically, this work fills critical gaps in agricultural GANs by integrating multi-scale feature fusion, local–global consistency preservation, and systematic cross-domain transfer. Practically, it provides a reliable solution for tea shoot detection under variable conditions, supporting automated harvesting and reducing labor costs. Future work will focus on lightweight optimization, cross-seasonal transfer, and multispectral extension to expand applicability in dynamic agricultural environments.

Author Contributions

Conceptualization, T.Y. and J.C.; methodology, T.Y., Z.G. and J.C.; software, J.J.; validation, Y.L.; formal analysis, T.Y.; investigation, T.Y. and Z.G.; resources, T.Y. and J.C.; data curation, T.Y., J.J., J.C., and Y.L.; writing—original draft preparation, C.Y.; writing—review and editing, T.Y. and J.C.; visualization, T.Y. and J.C.; supervision, J.C.; project administration, J.C.; funding acquisition, J.C., J.J., Y.L., and C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the National Natural Science Foundation of China (Grant Nos. U23A20175, 52305289, 32472009) and the earmarked fund for CARS (China Agriculture Research System)-19.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to our image datasets being self-built.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chan, S.P.; Yong, P.Z.; Sun, Y.; Mahendran, R.; Wong, J.C.M.; Qiu, C.; Ng, T.P.; Kua, E.H.; Feng, L. Associations of long-term tea consumption with depressive and anxiety symptoms in community-living elderly: Findings from the Diet and Healthy Aging Study. J. Prev. Alzheimer’s Dis. 2018, 5, 21–25. [Google Scholar] [CrossRef]
Li, L.; Wen, M.; Hu, W.; Huang, X.; Li, W.; Han, Z.; Zhang, L. Non-volatile metabolite and in vitro bioactivity differences in green, white, and black teas. Food Chem. 2025, 477, 143580. [Google Scholar] [CrossRef]
Wang, Y.; Li, L.; Liu, Y.; Cui, Q.; Ning, J.; Zhang, Z. Enhanced quality monitoring during black tea processing by the fusion of NIRS and computer vision. J. Food Eng. 2021, 304, 110599. [Google Scholar] [CrossRef]
Lu, J.; Luo, H.; Yu, C.; Liang, X.; Huang, J.; Wu, H.; Wang, L.; Yang, C. Tea bud DG: A lightweight tea bud detection model based on dynamic detection head and adaptive loss function. Comput. Electron. Agric. 2024, 227, 109522. [Google Scholar] [CrossRef]
Chen, C.; Lu, J.; Zhou, M.; Yi, J.; Liao, M.; Gao, Z. A YOLOv3-based computer vision system for identification of tea buds and the picking point. Comput. Electron. Agric. 2022, 198, 107116. [Google Scholar] [CrossRef]
Zhang, L.; Zou, L.; Wu, C.; Jia, J.; Chen, J. Method of famous tea sprout identification and segmentation based on improved watershed algorithm. Comput. Electron. Agric. 2021, 184, 106108. [Google Scholar] [CrossRef]
Yu, T.J.; Chen, J.N.; Chen, Z.W.; Li, Y.T.; Tong, J.H.; Du, X.Q. DMT: A model detecting multispecies of tea buds in multi-seasons. Int. J. Agric. Biol. Eng. 2024, 17, 199–208. [Google Scholar] [CrossRef]
Li, Y.; He, L.; Jia, J.; Lv, J.; Chen, J.; Qiao, X.; Wu, C. In-field tea shoot detection and 3D localization using an RGB-D camera. Comput. Electron. Agric. 2021, 185, 106149. [Google Scholar] [CrossRef]
Wang, X.; Wu, Z.; Fang, C. TeaPoseNet: A deep neural network for tea leaf pose recognition. Comput. Electron. Agric. 2024, 225, 109278. [Google Scholar] [CrossRef]
Xu, W.; Zhao, L.; Li, J.; Shang, S.; Ding, X.; Wang, T. Detection and classification of tea buds based on deep learning. Comput. Electron. Agric. 2022, 192, 106547. [Google Scholar] [CrossRef]
Nishad, P.; Chezian, R. Various colour spaces and colour space conversion algorithms. J. Glob. Res. Comput. Sci. 2013, 4, 44–48. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar] [CrossRef]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar] [CrossRef]
Gui, Z.; Chen, J.; Li, Y.; Chen, Z.; Wu, C.; Dong, C. A lightweight tea bud detection model based on Yolov5. Comput. Electron. Agric. 2023, 205, 107636. [Google Scholar] [CrossRef]
Wu, Y.; Chen, J.; Wu, S.; Li, H.; He, L.; Zhao, R.; Wu, C. An improved YOLOv7 network using RGB-D multi-modal feature fusion for tea shoots detection. Comput. Electron. Agric. 2024, 216, 108541. [Google Scholar] [CrossRef]
Krichen, M. Generative Adversarial Networks. In Proceedings of the 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 6–8 July 2023; pp. 1–7. [Google Scholar] [CrossRef]
Xiao, D.; Zeng, R.; Liu, Y.; Huang, Y.; Liu, J.; Feng, J.; Zhang, X. Citrus greening disease recognition algorithm based on classification network using TRL-GAN. Comput. Electron. Agric. 2022, 200, 107206. [Google Scholar] [CrossRef]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar] [CrossRef]
Espejo-Garcia, B.; Mylonas, N.; Athanasakos, L.; Vali, E.; Fountas, S. Combining generative adversarial networks and agricultural transfer learning for weeds identification. Biosyst. Eng. 2021, 204, 79–89. [Google Scholar] [CrossRef]
Yang, X.; Guo, M.; Lyu, Q.; Ma, M. Detection and classification of damaged wheat kernels based on progressive neural architecture search. Biosyst. Eng. 2021, 208, 176–185. [Google Scholar] [CrossRef]
Cang, H.; Yan, T.; Duan, L.; Yan, J.; Zhang, Y.; Tan, F.; Lv, X.; Gao, P. Jujube quality grading using a generative adversarial network with an imbalanced data set. Biosyst. Eng. 2023, 236, 224–237. [Google Scholar] [CrossRef]
Cap, Q.H.; Uga, H.; Kagiwada, S.; Iyatomi, H. LeafGAN: An Effective Data Augmentation Method for Practical Plant Disease Diagnosis. IEEE Trans. Autom. Sci. Eng. 2022, 19, 1258–1267. [Google Scholar] [CrossRef]
Li, X.; Li, X.; Zhang, M.; Dong, Q.; Zhang, G.; Wang, Z.; Wei, P. SugarcaneGAN: A novel dataset generating approach for sugarcane leaf diseases based on lightweight hybrid CNN-Transformer network. Comput. Electron. Agric. 2024, 219, 108762. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Zeng, S.; Zhang, H.; Chen, Y.; Sheng, Z.; Kang, Z.; Li, H. Swgan: A new algorithm of adhesive rice image segmentation based on improved generative adversarial networks. Comput. Electron. Agric. 2023, 213, 108226. [Google Scholar] [CrossRef]
Madsen, S.L.; Dyrmann, M.; Jørgensen, R.N.; Karstoft, H. Generating artificial images of plant seedlings using generative adversarial networks. Biosyst. Eng. 2019, 187, 147–159. [Google Scholar] [CrossRef]
Egusquiza, I.; Benito-Del-Valle, L.; Picón, A.; Bereciartua-Pérez, A.; Gómez-Zamanillo, L.; Elola, A.; Aramendi, E.; Espejo, R.; Eggers, T.; Klukas, C.; et al. When synthetic plants get sick: Disease graded image datasets by novel regression-conditional diffusion models. Comput. Electron. Agric. 2025, 229, 109690. [Google Scholar] [CrossRef]
Raya-González, L.E.; Alcántar-Camarena, V.A.; Saldaña-Robles, A.; Duque-Vazquez, E.F.; Tapia-Tinoco, G.; Saldaña-Robles, N. High-precision prototype for garlic apex reorientation based on artificial intelligence models. Comput. Electron. Agric. 2025, 235, 110375. [Google Scholar] [CrossRef]
Abbas, A.; Jain, S.; Gour, M.; Vankudothu, S. Tomato plant disease detection using transfer learning with C-GAN synthetic images. Comput. Electron. Agric. 2021, 187, 106279. [Google Scholar] [CrossRef]
Lacerda, C.F.; Ampatzidis, Y.; Costa Neto, A.d.O.; Partel, V. Cost-efficient high-resolution monitoring for specialty crops using AgI-GAN and AI-driven analytics. Comput. Electron. Agric. 2025, 237 Pt B, 110678. [Google Scholar] [CrossRef]
Krestenitis, M.; Ioannidis, K.; Vrochidis, S.; Kompatsiaris, I. Visual to near-infrared image translation for precision agriculture operations using GANs and aerial images. Comput. Electron. Agric. 2025, 237 Pt C, 110720. [Google Scholar] [CrossRef]
Afzal Maken, F.; Muthu, S.; Nguyen, C.; Sun, C.; Tong, J.; Wang, S.; Tsuchida, R.; Howard, D.; Dunstall, S.; Petersson, L. Improving 3D Reconstruction Through RGB-D Sensor Noise Modeling. Sensors 2025, 25, 950. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, June 16–20 2019; pp. 510–519. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv 2017, arXiv:1706.08500. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar] [PubMed]
Rana, S.; Gatti, M. Comparative evaluation of modified Wasserstein GAN-GP and state-of-the-art GAN models for synthesizing agricultural weed images in RGB and infrared domain. MethodsX 2025, 14, 103309. [Google Scholar] [CrossRef] [PubMed]

Figure 1. (a) Longjing 43 under daylight conditions, (b) Zhongcha 108 under supplementary night lighting.

Figure 2. Localized shoot images demonstrating cultivar-specific morphological characteristics under different lighting regimes.

Figure 3. Schematic representation of the Tea CycleGAN architecture with bidirectional domain translation.

Figure 4. Selective Kernel Convolution (SKConv) unit architecture.

Figure 5. Residual block incorporating SKConv for enhanced feature learning.

Figure 6. Improved generator architecture with SKConv-enhanced residual blocks.

Figure 7. Architecture of the enhanced discriminator with deep convolutional layers.

Figure 8. Algorithm framework of a multi-scale style transfer data augmentation method. The red box is the labeled area of “tea shoot” in the dataset.

Figure 9. Loss function graph.

Figure 10. Parameter analysis of the Longjing 43 dataset.

Figure 11. Detection Effect Comparison Diagram. (Red boxes indicate correct detections, while blue boxes indicate undetected tea shoots). (a) Ground truth, containing 29 tea shoots in total; (b) Detection result of the proposed method, with 4 undetected tea shoots; (c) Detection result of the Mosaic augmentation method, with 5 undetected tea shoots; (d) Detection result of the Mixup augmentation method, with 6 undetected tea shoots; (e) Detection result of the non-multi-scale style transfer augmentation method, with 8 undetected tea shoots; (f) Detection result without data augmentation, with 9 undetected tea shoots.

Figure 12. t-SNE visualization.

Table 1. Dataset Composition and Partitioning.

Cultivar	Total Quantity	Detection Network Training Dataset	Detection Network Test Dataset	Generative Adversarial Network Training Dataset	Generative Adversarial Network Test Dataset
LJ43	1000	500	100	300	100
ZC108	1000	500	100	300	100

Table 2. FID and MMD Comparison of Different GAN Architectures for 600 × 600 Images.

Model	FID (Mean ± SD)	95% CI for FID	MMD (Mean ± SD)	95% CI for MMD
Tea CycleGAN	42.26 ± 1.43	[41.08, 43.44]	0.02241 ± 0.0015	[0.0212, 0.0236]
CycleGAN + SKConv	47.32 ± 1.69	[45.98, 48.66]	0.03263 ± 0.0021	[0.0309, 0.0343]
CycleGAN + Improved discriminator	53.57 ± 2.01	[51.95, 55.19]	0.05942 ± 0.0028	[0.0572, 0.0616]
CycleGAN	75.38 ± 2.56	[73.35, 77.41]	0.06137 ± 0.0032	[0.0588, 0.0639]
DCGAN	131.98 ± 3.32	[129.35, 134.61]	0.07086 ± 0.0035	[0.0680, 0.0737]
WGAN	254.76 ± 4.53	[251.01, 258.51]	0.10245 ± 0.0048	[0.0985, 0.1064]

Table 3. FID and MMD Comparison of Different GAN Architectures for 64 × 64 Images.

Model	FID (Mean ± SD)	95% CI for FID	MMD (Mean ± SD)	95% CI for MMD
Tea CycleGAN	26.75 ± 1.03	[25.95, 27.55]	0.02452 ± 0.0011	[0.0237, 0.0253]
CycleGAN + SKConv	32.79 ± 1.26	[31.82, 33.76]	0.02529 ± 0.0012	[0.0244, 0.0262]
CycleGAN + Improved discriminator	44.23 ± 1.61	[42.98, 45.48]	0.02538 ± 0.0012	[0.0245, 0.0262]
CycleGAN	57.51 ± 1.94	[55.98, 59.04]	0.02845 ± 0.0014	[0.0274, 0.0295]
DCGAN	67.32 ± 2.21	[65.59, 69.05]	0.04767 ± 0.0024	[0.0458, 0.0495]
WGAN	82.25 ± 2.47	[80.32, 84.18]	0.04851 ± 0.0026	[0.0465, 0.0505]

Table 4. 600 × 600 Image Generation Comparison.

Model	Generated Image
Real Zhongcha 108
Real Longjing 43
Tea CycleGAN
CycleGAN + SKConv
CycleGAN + Improved discriminator
CycleGAN
DCGAN
WGAN

Table 5. 64 × 64 Image Generation Comparison.

Model	Generated Image
Real Zhongcha 108
Real Longjing 43
Tea CycleGAN
CycleGAN + SKConv
CycleGAN + Improved discriminator
CycleGAN
DCGAN
WGAN

Table 6. Comparison of data placement method s’ impact on object detection performance.

Placement Method	Restoration-Paste	Probability-Based	Random
mAP	83.54	69.19	61.48
P	86.03	66.69	51.59
Recall	55.18	43.16	40.49

Table 7. Comparison of the gain effect of various data enhancement methods on target detection.

Placement Method	Restoration-Paste	Probability-Based	Random
Generated Image

Table 8. Comparison of data augmentation methods’ impact on object detection performance.

Data Augmentation Method	mAP	P	Recall
Ours + Realdata	83.54	86.03	55.18
Mosaic + Realdata	80.13	83.52	50.92
Mixup + Realdata	78.93	81.49	48.74
None + Realdata	75.21	79.12	46.20
original data	73.94	75.25	45.25

Table 9. Comparison of images generated by different data augmentation methods.

Data Augmentation Method	Generated Image
Ours
Mosaic
Mixup
None
Original data

Table 10. t-SNE Embedding Metrics for Domain Alignment.

	Mean Sigma	KL Divergence
Real LJ43 to real ZC108	3.406924	0.621378
fake LJ43 to fake ZC108	2.837938	0.532295
Real LJ43 to Real LJ43	2.291254	0.410419
Real LJ43 to fake LJ43	2.755302	0.490285
real ZC108 to real ZC108	2.412863	0.345618
Real ZC108 to fake ZC108	2.613834	0.389483

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Multi-Scale Cross-Domain Augmentation of Tea Datasets via Enhanced Cycle Adversarial Networks

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Research Plan

2.2. Data Collection

2.3. Model

2.3.1. CycleGAN

2.3.2. Generator Combined with SkNet Attention Mechanism

2.3.3. Enhanced Discriminator Architecture

2.4. Multi-Scale Style Transfer-Based Data Augmentation Framework

2.5. Classical Object Detection Network

2.6. Experimental Setup

2.6.1. Training Environment

2.6.2. Training Parameters

2.6.3. Evaluation Metrics

3. Results and Discussion

3.1. Comparative Evaluation of GANs

3.2. Influence of Shoot Placement Methods on Virtual Dataset Effectiveness

3.3. Comparative Testing of Data Augmentation Methods

3.4. t-SNE Visual Embedding Analysis for Domain Alignment

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics