1. Introduction
The rapid advancement of technology has enabled the widespread adoption of data processing techniques based on artificial intelligence (AI), with deep learning (DL) emerging as one of the most prominent approaches across both scientific and professional communities. DL has proven particularly effective for pattern analysis and recognition in imagery-based data. Within the realm of supervised learning, DL techniques have been extensively applied to automate tasks such as classification and object detection, offering significant contributions to various domains including industry [
1], medicine [
2], and agriculture [
3]—fields in which the availability of high-quality, timely annotated datasets plays a critical role.
In viticulture, for instance, the timely detection of agro-environmental conditions that influence pest and disease outbreaks is essential for informed decision-making and the proper implementation of targeted interventions in vineyard management. To support this, specialized IoT platforms such as mySense [
4,
5] have proven valuable in monitoring a range of crop-influencing factors. In integration with mySense, VineInspector [
6] demonstrates how DL-powered visual approaches can extract actionable insights by detecting and classifying events and anomalies in vineyards. These include monitoring vine shoot size in relation to primary downy mildew infections and identifying the presence of
Lobesia botrana—European grape moth—in pheromone traps.
However, the development of effective visual inference models hinges on the availability of large, high-quality annotated datasets—whose acquisition is often constrained by the natural seasonality of viticulture [
7]. More specifically, data collection for training models to detect specific events may only be feasible during narrow windows of the phenological cycle, which, in the case of being missed, can significantly delay the training process and, consequently, the deployment of operational systems. For example, in a study focused on yield estimation and wine-making potential based on grape counting [
8], data acquisition and annotation campaigns spanned a couple of years, targeting the time window from July to September, during which the grapes develop and mature until harvest. Another noteworthy example refers to ampelography [
9], where data acquisition was concentrated over a few months, corresponding to the phenological stage during which grapevines develop their canopy foliage. Nevertheless, the process took two years to complete, labeling included. In another related study, Kontogiannis et al. [
10] collected and annotated 6800 images from ground stations and drones during the growing season to support the development of a downy mildew detection solution. Also, in works involving trap-based grape moth detection [
6], one can infer the significant logistical effort required towards the development of the respective solutions—including trap preparation, waiting periods to capture the insects of interest, and subsequent imagery retrieval and annotation procedures.
To address the limitations associated with dataset scarcity and the significant effort required to develop deployable image-based inference solutions, this work draws inspiration from prior research (e.g., [
9,
11]) to explore synthetic data generation as a strategy for developing initial yet functional DL models. By creating synthetic, domain-compliant training datasets, the proposed approach reduces early reliance on time-consuming procedures from image acquisition to annotation, thereby potentially accelerating the deployment of inference models that, however, should be iteratively refined over time, for example, through active learning techniques. Two complementary synthetic image generation methods are proposed: (i) classical digital image processing techniques and (ii) advanced text-to-image diffusion models. These methods are applied and tested in two relevant viticulture-related use cases: the detection of
Lobesia botrana, a major grapevine pest, and the identification of common but potentially destructive vine diseases such as black rot, leaf blight, and ESCA. In addition to presenting the generated datasets, models’ performances, and operational insights, this paper discusses each proposed synthetic data generation approach, outlining some of their advantages and limitations, and positioning them within the broader literature.
The remainder of this paper is structured as follows:
Section 2 reviews related work;
Section 3 describes the proposed synthetic data generation strategies;
Section 4 presents the experimental results;
Section 5 offers a critical analysis and discussion; and
Section 6 concludes with directions for future research.
2. Related Work
At the core of reliable supervised machine learning (ML) and DL decision-support models—whether applied in healthcare [
2], infrastructure engineering [
12], the automotive industry [
1], the textile sector [
13], footwear businesses [
14], or inclusive technologies [
15]—lies the availability of well-structured, validated, and problem-representative datasets. This requirement is equally critical in the domains of agriculture [
16], where the need for capturing the idiosyncrasies of crops and plants under naturally variable environmental conditions increases complexity.
In agricultural and viticultural contexts, data collection is particularly challenging. It is often restricted by seasonal cycles and specific phenological stages, which limit the opportunity to gather large-scale, annotated datasets. Despite the growing availability of repository systems that encourage cross-domain public dataset sharing—including in agriculture [
3,
17,
18]—the scarcity of labeled training data remains one of the most critical bottlenecks for deploying deep learning models in real-world agricultural settings [
3]. Building ML/DL models that can effectively operate in such environments requires not only advanced architectures, but also a comprehensive understanding of the domain problem at hand, allowing for the design of representative datasets either by engaging initiatives of data acquisition from scratch or by augmenting upon existing repositories [
16].
IoT platforms such as mySense [
4] have made strides in this area by enabling crowd-sourced contributions and expert validation for agricultural data aimed at supporting ML/DL-based decision making. Notably, mySense supports intelligent node integration, as demonstrated in [
6] through the VineInspector system. This system tracks shoot growth to monitor infection risks such as downy mildew, while also observing pheromone traps for European grapevine moth to assess pest presence and activity. Also focusing on edge computing strategies, Gonçalves et al. [
19] developed a dataset comprising 168 images with 8966 insect instances that needed to be manually annotated. This dataset was then used to train and benchmark five models—e.g., SSD MobileNetV2 and EfficientDet-D0. The class-wise accuracies ranged from
to
, with F1-scores between
and
. SSD ResNet50 was the architecture that achieved the best performances. Regarding inference times, the minimum observed was greater than 19 s per image in high-end smartphones. Concerned with the main vectors of Flavescence dorée—
Scaphoideus titanus and
Orientus ishidae—Checola et al. [
20] used sticky traps to retain the target insects, enabling the construction of a dataset comprising 600 images and approximately 1500 annotations per class. YOLOv8 and Faster R-CNN were employed to support deep learning-based detections, with the former achieving superior performance: a mean average precision at 0.5 overlap (mAP@0.5) of
, and an F1-score above
.
Regarding the automatic detection and identification of grapevine leaf diseases, Xie et al. [
21] developed the Grape Leaf Disease Dataset (GLDD), comprising 4449 original images across four classes—black rot, black measles, leaf blight, and mites. They also proposed a deep learning model named Faster DR-IACNN, which integrates components from established architectures such as Inception-ResNet-v2 for advanced feature extraction. The model achieved an mAP of
. In another work with similar goals, Hasan et al. [
22] developed and assessed their own convolutional neural network (CNN) model, reporting an accuracy of 91.37% for distinguishing black rot, ESCA, leaf blight, and healthy leaves from a public dataset [
23] (built based on PlantVillage [
17]). Matching the same disease classes, Prasad et al. [
24] modified a VGG-16 CNN, whose performance reached a top accuracy of
in tests on a public and relatively balanced dataset composed of 9027 images [
25]. In [
26], the same set of diseases was considered through the partial integration of the PlantVillage dataset [
17]. In total, 14 CNN and 17 vision transformer models were assessed and compared, with 4 of them achieving an accuracy of
. In another work, an Xception model fine-tuned by Nasra and Gupta [
27] attained an accuracy of
over a public grapevine diseases dataset [
28] with characteristics similar to the ones previously addressed. More recently, Talaat et al. [
29] proposed a plant disease detection algorithm, composed of a few modules, including for feature extraction and CNN-based classification, which works combined with a grid search strategy for hyperparameters optimization. Considering another PlantVillage [
17] inspired dataset [
30], an accuracy of
was observed in tests.
These works are essentially grounded in supervised learning, where labeled data is essential. However, manual annotation is often time-consuming and labor-intensive [
3]. Alternatives like self-supervised learning—as in the case of generative adversarial networks (GANs)—reduce reliance on labeled data but are still constrained by the quality and biases of the original datasets [
31]. In complement, semi-supervised learning consists of either (i) combining labeled and unlabeled data to learn more robust features and therefore improve models generalization capabilities; or/and (ii) querying users (preferably, experts) for labeling specific data, leading to a sustainable active learning [
3]. While in such approaches uncertainty is an accountable factor for generalization, strategies considering human-in-the-loop involvement can be quite beneficial [
32], as they facilitate continuous model improvement. In turn, faster model deployment becomes viable, particularly in scenarios where sub-optimal performance is acceptable in initial setups.
To tackle the challenges posed by data scarcity, the need to perform generally burdensome and time-consuming data collection and annotations, and the time frame limitations imposed by nature-driven events (as in agricultural/viticultural contexts), the generation of synthetic datasets is becoming an increasingly explored strategy by both scientific and professional communities, with notable potential to support the rapid development of initial, yet deployable, deep learning models. For instance, Kubric [
33] is a general-purpose engine capable of generating photorealistic scenes by combining various technologies and repositories containing prefabricated 3D assets while maintaining links to individual scene components to facilitate automated annotation. However, content generation appears to be either random or dependent on significant scripting skills and coding effort—especially in scenarios where object specificity and spatial distribution are critical—to build such virtual environments. Kubric is also classified as a substantial computational resources consumer. In another approach [
34], an automated framework was proposed to generate CAD-based data with accurate labels for large-scale robotic bin-picking datasets, with applications in industrial settings and tracked conditions. More focused in natural environments, the PROMORE framework [
11] uses procedural 3D modeling to produce synthetic images for object detection and segmentation in remote sensing. Prior to this framework, a preliminary system for virtual wildfire generation within a 3D environment [
35]—constructed using a photogrammetric process based on remotely piloted aerial systems (RPASs)—was developed and tested using neural networks specialized in segmentation. The results showed promising indications of the potential of synthetic data to train models capable of operating in real-world scenarios. In another study, Adão et al. [
9] explored grapevine variety classification through single-leaf image analysis. Building on Xception-based models and anticipating broader field applications, they created synthetic imagery involving leaves of an existing dataset embedded in vineyard-like backgrounds, enabling segmentation and classification work-flows. Other strategies relying on the combination of large language models (LLMs) with visual foundation models have opened new possibilities for generating synthetic datasets directly from text descriptions [
36]. These technologies can be employed for rapid dataset creation for training custom inference models, even in highly specific agricultural domains.
Inspired by some of the above approaches and motivated by the persistent challenges in agricultural DL deployment, this paper explores synthetic dataset generation methods to accelerate the operationalization of DL models in precision viticulture. The proposed strategies are presented in the following sections.
3. Proposed Dataset-Related Methods for Expeditious DL Modeling
This section outlines the foundational principles of the proposed methods for guiding the generation of synthetic datasets through image-based coarse problem approximation, with the aim of bootstrapping DL models and circumventing the need for explicit data collection, annotation, and organization from scratch.
3.1. Dataset Design Approaches Overview
The image processing work-flow illustrated in
Figure 1 describes a procedure in which background imagery is combined with foreground elements—referred to as elements of interest (EOIs). Specifically, for each image to be generated, a background is randomly selected from a local archive and subjected to a series of random transformations to introduce data diversity. These include scaling within a given range
, adjustments to brightness and contrast within
, and vertical flipping.
Subsequently, EOIs are selected through a stochastic process, resized within a predefined range—based on real-world measurements—and subjected to further transformations such as random rotation (within ), brightness and contrast adjustments (), and vertical or horizontal flips. EOIs are iteratively placed onto the background until either a predefined limit is reached, or a maximum number of placement attempts is exceeded—governed by an exclusion criterion that considers overlap tolerance between new EOIs and those already placed.
The resulting composite images and corresponding annotations are exported, ensuring the compatibility of a well-known object detection architecture—You Only Look Once (YOLO).
The integration of large language models (LLMs such as ChatGPT [
37]), with diffusion models (e.g., DALL·E [
38]) represents one of the most significant technological advancements in recent years, enabling artificial intelligence (AI) to perform generative tasks, including the autonomous creation of realistic imagery, grounded in a semantic understanding of, for example, user-provided textual input. Such integration is also explored in the present work aiming to accelerate and facilitate synthetic data generation. As illustrated in
Figure 2, the user defines a set of generation requirements, which can be iteratively reviewed and refined as needed. Once finalized, a request for the bulk generation of elements—such as images and corresponding annotations—is issued, and the resulting outputs can be downloaded upon completion of the generation process.
3.2. Collaborative Inference Design
A strategy to reach time-effective DL modeling while short-cutting steps related to in-field data acquisition, organization, and so on, may rely on combining available annotated imagery sources and automation pipelines for simulating not-yet-established datasets.
Figure 3 proposes a flowchart encompassing such strategy.
For example, if a dataset cataloging a specific phenomenon (e.g., grapevine diseases) is already available in public repositories, it can be directly used to develop a single-instance classification model. However, for this model to function effectively in natural environments with complex visual compositions (e.g., a full grapevine canopy), a filtering approach is required to first identify and isolate EOIs. One viable way to enable such filtering is through the creation of a synthetic, object detection-oriented dataset designed to reflect real operational conditions. Such a dataset can be achieved by strategically positioning contour-based, isolated EOIs onto realistic background images, emulating the relevant settings that depict the target problem, while also generating corresponding annotations automatically—leveraging the algorithmic control over element placement. In turn, resulting synthetic images and their corresponding annotations can be used to train a preliminary object detection model capable of recognizing and isolating general instances. During operation, this object-detection model is responsible for primarily identifying EOIs, which are then cropped and passed to a classifier trained on the original, phenomenon-driven dataset, enabling the assignment of more specific labels (e.g., healthy, disease A, disease B, …, disease N). This synergistic strategy combines the strengths of both approaches: the availability of labeled datasets and the efficiency of automated, representative synthetic data generation.
Once the desired models have been trained, they can collaborate by being integrated into a sequential pipeline, as illustrated in
Figure 4. In this process, an input image is processed by an initial model, which distils relevant information to be interpreted by a subsequent model. This flow enables a modular and layered approach, where different models contribute distinct processing capabilities, progressively enriching the understanding derived from the original input.
The following section demonstrates the application of the proposed dataset generation methods through two case studies: grape moth detection and vine leaf-based disease recognition.
6. Discussion
This work proposed and explored a set of methods based on synthetic data generation and was complemented by existing datasets, with the goal of enabling the rapid development of preliminary and yet functional DL models suitable for field deployment. Such methods involve customized image generation through LLM and diffusion models, domain-relevant composition strategies, and automatic annotation, which leads to datasets compliant with modern DL tools for model training. The case studies selected to demonstrate the viability of these methods lay within precision viticulture scope. Initially, an approach was developed for modeling datasets oriented to the challenge of detecting grape moths in pheromone traps. Generative AI was used to produce empty traps, while publicly available datasets [
39,
40] allowed to extract insect samples directly from the images. Classic image processing techniques were applied to compose synthetic images, wherein contour-based cropped grape moth specimens were placed over AI-generated trap backgrounds. The resulting YOLO-compatible dataset enabled the training of a YOLOv8s model—selected for its efficiency in edge computing environments—that attained precision values close to 0.99 and an mAP of approximately 0.86. Despite achieving a performance that seems to cope with those reported in the literature for similar contexts—e.g., [
6,
19]—, it should be noted that, due to the use of such a synthetic dataset for training, this grape moth detection model may lack environmental variability and, as such, be susceptible to blind spots when shifts in operational conditions occur (e.g., changes in lighting, background variation, or the presence of different insect species).
As for disease identification, the combination of available datasets with synthetic data to rapidly expand DL applicability was also explored. An Xception model was trained following the directives of a previous study [
9] and using a publicly available dataset from Kaggle [
25], which consists of a set of annotated images depicting healthy and infected leaves acquired under laboratory conditions. In similar datasets, the model achieved high accuracy (nearly
), comparable to or even outperforming results reported in the related literature [
22,
24,
27,
29]. These results also demonstrate the viability and flexibility of the training process proposed in [
9], which was initially developed for digital ampelography but has proven effective for grapevine disease identification based on visual symptoms in the leaves. However, due to the limitations of the utilized dataset in representing field conditions—where leaves are embedded within complex canopies—a leaf detection model was required to bridge the gap between laboratory data and real-world deployment. An initial attempt to generate such a dataset relied solely on ChatGPT’s integration with DALL·E. While the tool produced visually convincing images, the annotations were found to be unreliable or entirely random, making the approach unfeasible for direct dataset construction. As an alternative, ChatGPT/DALL·E were used to separately generate leafless grapevine backgrounds and isolated grapevine leaves with various visual characteristics (e.g., healthy, chlorotic, or spot-affected). Some key operational considerations emerged during this process:
Before generation, requirements must be specified in detail and confirmed by ChatGPT through summarization to mitigate potential issues related to linguistic ambiguity or model misinterpretation or even drifting;
During batch generation, interruptions often occurred, requiring manual prompting to resume the process.
Once both background and foreground libraries were prepared, traditional image processing techniques—similar to those used for the grape moth case—were employed to synthesize new images and generate precise annotations. The resulting YOLOv8s leaf detector achieved strong results on unseen synthetic data (precision and mAP above 0.97), although performance declined when tested on external, real-world imagery (originally acquired in Nossa Senhora de Lourdes farm—
N,
W), revealing some limitations in generalization. Despite such limitations, the integrated use of the YOLOv8s-based leaf detector and the Xception-based disease classifier within a unified pipeline yielded promising outcomes. In external imagery, although not all leaves were detected, the inference system was still able to correctly classify diseased leaves among those successfully extracted, with class-wise accuracies ranging from
to
, supporting the feasibility of the proposed approach. Overall, the proposed methods offer a promising pathway for the rapid deployment of first-generation AI models, as evidenced by the tests conducted in viticultural scenarios. When integrated into an active learning framework—such as the one illustrated in
Figure 15, inspired by works like [
32]—these models can function not only as early inference tools but also as key enablers of a continuous improvement strategy. High-confidence predictions may directly support decision-making in the field, while low-confidence outputs can be flagged for expert review and incorporated into subsequent training cycles. This iterative process enables continuous refinement and adaptation of the models, leading to progressively improved performance and closer alignment with ideal real-world operating conditions.
One final note concerns the methodological positioning of this work within the broader literature on automatic approaches for generating credible datasets and annotations. While many of the existing studies—particularly in agriculture [
47]—employ generative AI to enhance datasets by creating new instances from existing training data with the aim of improving DL model performance, the approaches proposed in this paper are fundamentally different. Specifically, foundational datasets are synthetically created from scratch and roughly modeled to reflect the general conditions of operational environments. The goal is to enable the initial deployment of DL models in a time-effective manner, allowing them to be incrementally improved over time through, for example, active learning strategies [
32], rather than relying on the continuous expansion of existing datasets. Notwithstanding, while using ChatGPT/DALL·E to create fully labeled datasets remains an ongoing challenge—since these tools lack introspective mechanisms for determining the positioning and spatial occupation of entities of interest (EOIs), as observed in
Section 4—other approaches (e.g., [
11,
33]) have explored alternative strategies grounded in the assembly of realistic 3D scenes with automatic annotation capabilities. Complementarily, the approach presented in this paper combines the believable 2D generation capabilities of natural language-driven models, as in the former, with the parameterization and control typical of the latter, aiming to provide context-aware synthetic imagery that is both quick and relatively simple to produce, accompanied by accurate annotations. As all of these approaches are still evolving to shape usable AI-oriented synthetic data sourcing strategies, their consolidation will naturally motivate future comparative analyses aimed at better understanding the strengths and limitations of each proposal.
The next section ends this paper by summarizing the main contributions, as well as by delineating a few orientations for future work.
7. Conclusions and Future Work
The development of reliable visual inference models is often hindered by the burdensome and time-consuming nature of collecting and annotating high-quality datasets. To address this challenge, this work explored the use of synthetic dataset generation to bootstrap DL models, with proof-of-concept applications in the context of precision viticulture.
Two case studies were examined. In the first, a fully synthetic dataset was generated to train a YOLOv8-based object detection model for identifying grape moths in pheromone trap imagery. In the second, a hybrid inference pipeline was proposed, combining a YOLOv8-based model for leaf detection in grapevine canopy images with an Xception-based classifier for disease identification. This two-stage approach enabled the classification of leaves into four categories—healthy, black rot, ESCA, and leaf blight—highlighting the synergistic potential of integrating object detection and classification tasks within a modular architecture.
Results from both scenarios demonstrate that fully or partially synthetic visual data pipelines can provide a valid foundation for the initial deployment of inference models, particularly in domains where annotated data is scarce. Although these models are not flawless, they showed promising performance when evaluated on synthetic imagery unseen during training, as well as apparently good operational capabilities under relevant conditions, based on visual assessment of detection and classification outputs. Additional inference results from collaborative models for grapevine leaf detection followed by disease recognition also provided reasonable indications of suitability for initial DL deployment. Nevertheless, it is important to emphasize the need for continuous improvement processes—potentially driven by active learning strategies—to enhance model robustness over time. In addition to the presented use cases, the proposed synthetic data generation approaches could also benefit other applications in—though not limited to—agriculture, particularly in visual contexts that share similar meta-requirements regarding scene composition. Examples include the detection of animals posing threats to crops (e.g., birds), anti-theft surveillance, drone-based cattle detection and counting, and agricultural fire monitoring, among others.
Looking ahead, future work will focus on deploying these models in edge computing environments to enable real-time, in-field operation. Additionally, active learning frameworks will be integrated, allowing the models to flag uncertain predictions for expert review. This feedback loop will support continuous, data-driven model refinement, progressively improving accuracy and adaptability under real-world conditions.