1. Introduction
Crop diseases that affect grapevines, such as downy mildew and powdery mildew, are undesirable due to the consequences caused by their severe infection. The quantity and also the quality of grapevine are decreased, leading to yield and financial loss for farmers. Different fungal or fungal-like pathogens and pest infestation are the cause of disease development. Various fungicides are used to control pathogens, but their extensive use has led to environmental degradation and biodiversity loss and has raised health concerns for both farmers and consumers. Here, the suggested workflow solution for the early prediction of diseases would lead to increased crop yield, better grapevine and wine quality, and reduced costs for farmers.
Downy mildew is a severe fungal disease that is caused by the
Plasmopara viticola pathogen. Downy mildew’s symptoms are present on both the leaves and vines. Once established on a plant,
Plasmopara viticola spreads rapidly through secondary infections, producing a characteristic white, downy sporulation on the underside of the leaves. On the upper surface, oil-spot lesions appear which can evolve into white fungal growth underneath the surface and later turn necrotic [
1]. If left uncontrolled, the disease leads to premature defoliation, weakens the vines, and causes significant reductions in grape yield [
2].
Historically, downy mildew has devastated vineyards, particularly in Europe. For example, it caused a 70% reduction in French grape production in 1915 and significant losses in Germany and Italy during the early 20th century [
1,
3]. The disease not only reduces yields but also affects the organoleptic characteristics of wine, affecting its appearance, aroma, and taste.
Powdery mildew is caused by the
Erysiphe Necator pathogen and differs from downy mildew due to the lower moisture level requirement for infecting grapevines. An obvious symptom of this disease is the white powder cover on leaves that can affect all parts of the vine, including its fruit. Since this white powder covers a major part of the leaves, it leads to reduced photosynthesis and exposes berries to sunburn due to defoliation [
4,
5].
The significant impact of powdery mildew is observed in juice quality: that is, the combination of sugar accumulation, the intensity of the red color, browning, and acidity [
6]. Powdery mildew appears to affect sugar accumulation in grapevines in a manner similar to other chronic stress factors, such as drought or defoliation [
7]. Failing to control powdery mildew may lead to decreased Brix levels (even lower than the levels accepted by processors) and also chronic reductions in grapevine growth.
Of course, control measures are implemented to mitigate the symptoms of the disease, but still, the economic loss is substantial. The disease reduces yields and also lowers fruit quality. Particularly, a 50% disease incidence in vineyards in South India resulted in substantial losses [
8]. Pool et al. [
9] recorded a 40% reduction in the vine size and a 65% reduction in the yield of Rossetee [
10] because of powdery mildew.
A grapevine disease prediction model would fail if it did not take environmental conditions into consideration because they are crucial in enabling the pathogens that cause the diseases [
11,
12]. Specifically, downy mildew needs a moderate environment (about 15–23 °C) and high humidity (over 80%). Also, rainfall plays a critical role in downy mildew outbreaks by providing the necessary moisture for spore germination and dispersal [
4]. On the other hand, powdery mildew needs warmer conditions (about 17–28 °C) and is not as dependent on moisture levels for infection to occur. Temperatures above 40°C stop its development, and the relative humidity should be above 45% [
10,
13,
14].
Thus, taking into consideration that specific environmental conditions, such as temperature and humidity levels, affect the rate of plant disease infection, a pipeline system driven by weather and environmental parameters to predict the pathogen’s presence in grapevines and the disease at an early stage is suggested in this work. The importance of an early grapevine disease prediction system is also highlighted by the fact that disease symptoms are not clearly visible in the early stages of infection, in which case the plant needs to be carefully inspected by the farmer.
Quite a few studies have explored crop disease prediction, with a particular focus on grapevine diseases, utilizing environmental Internet-of-Things (IoT) data. Various machine learning (ML) approaches have been employed, including Hidden Markov Models (HMMs) [
15] or decision trees and neural networks [
16], while others just followed a rule-based approach [
17]. Convolutional neural networks (CNNs) and deep learning techniques have also been applied in image-based analyses for grapevine disease recognition [
18,
19]. Additionally, a publicly available dataset [
20], comprising three features over five months, has been utilized in hybrid rule-based predictions for downy mildew and powdery mildew diseases. In contrast, our dataset spans over four years and includes annotations for both diseases, providing more extensive temporal coverage and detailed labeling. A recent study similar to ours [
21], by Zhao and Efremova, combined multispectral imagery and environmental parameters to predict grapevine diseases at the block level. Their approach incorporated numerous features, which may have increased the complexity of the prediction task. Consequently, the TabPFN transformer did not significantly outperform models like XGBoost or CatBoost in their study. In contrast, our work leverages a curated, multiyear dataset of IoT sensor measurements, focusing on critical environmental features that facilitate the development of downy mildew and powdery mildew. We integrate these features with labels indicating whether fungicide treatments were applied, aiming to enhance predictive accuracy while maintaining feature simplicity.
Our objective is to perform early assessments of grapevine disease risks using monitored IoT environmental data and the corresponding treatment labels. By employing the TabPFN transformer, which delivers outputs in under one second, our workflow supports real-time applications in precision viticulture. This approach allows for proactive interventions before the manifestation of visible symptoms, enhancing overall grapevine health, yields, and fruit quality while concurrently reducing fungicide applications, agricultural input costs, and environmental impacts.
From a machine learning perspective, the early prediction and detection of grapevine diseases involve training models to recognize environmental patterns associated with disease development. These patterns encompass specific conditions that facilitate pathogen proliferation and the subsequent infection of grapevines. Traditionally, diseases are identified only after visible symptoms manifest on the crop. However, our objective is to anticipate the onset of the disease before such symptoms appear. Using IoT sensors to monitor environmental parameters and feeding these data into machine learning classifiers, we can predict disease risks in a timely manner. This proactive approach allows the assessment of the risks of pathogen infestation and the implementation of treatments before outbreaks occur or as the disease begins to progress. The benefits of this method are evident in the overall health of plants, leading to significant improvements in both the quantity and quality of grapevine yields. In addition, early disease prediction enables the application of precise treatment amounts, thereby reducing costs for farmers. The contribution of our work can be summarized as follows:
We make our self-curated and disease-annotated IoT environmental data publicly available, spanning from 2020 to May of 2024, facilitating further research in this domain.
We perform a comparative analysis of different tabular data classifiers.
We demonstrate the efficacy of the TabPFN transformer on IoT environmental tabular data, highlighting its advantages over other predictors.
We present a workflow capable of early disease prediction and operable under real-time conditions due to the rapid inference capabilities of the TabPFN transformer.
2. Materials and Methods
This study aims to develop an early prediction system for grapevine diseases based on environmental conditions, thereby enabling timely interventions to prevent disease progression. To achieve this, we collected environmental data, specifically temperature, humidity, and rainfall accumulation, using IoT sensors deployed in vineyard settings. These parameters were processed into a feature vector comprising seven elements: mean, minimum, and maximum values for both temperature and humidity, along with cumulative rainfall measurements.
These feature vectors served as inputs for ML classifiers designed to assess the risk of disease outbreaks. The models were trained using binary labels indicating the presence or absence of disease, inferred from records of fungicide applications by farmers and agronomists as preventive or curative measures.
Given the challenges associated with real-world datasets, such as limited sample sizes and class imbalances, data augmentation and balancing techniques were employed to enhance the model’s robustness. We evaluated multiple classification algorithms to determine the most effective model for predicting disease risk.
The dataset encompassed environmental data collected from 2020 to 2024, which, along with the augmented data, were used for training and testing the ML models. Data from 2023 and 2024 were reserved for testing and validation purposes. This approach aligns with contemporary practices in precision viticulture, where integrating IoT sensor data with ML algorithms facilitates proactive disease management strategies.
The overall suggested workflow and the steps followed to perform grapevine diseases prediction are shown in
Figure 1.
2.1. Data Description
2.1.1. Grapevine Field Description
This study was conducted in a 7-hectare experimental vineyard located in Tebano (Faenza, RA, Italy), Emilia-Romagna, Italy. The vineyard is situated on flat terrain characterized by a clayey loam soil. It is cultivated under integrated crop management methods, combining sustainable agronomic practices to optimize vine health and productivity. The vineyard hosts multiple grapevine cultivars, including Sangiovese, Trebbiano, Pignoletto, Albana, Lambrusco Salamino, Lambrusco Sorbara, and Ancellotta, all grafted onto Kober 5BB rootstocks. The vines were planted in 2018 and were trained using the Guyot system, which was designed for low-to-moderate vigor vineyards. The planting layout consists of 2.7 m between the rows and 1 meter between vines within the same row. The vineyard is on a flat surface and oriented southwest. Irrigation was managed through a drip irrigation system at an irrigation flow rate of 4 L/h, with drippers spaced every 50 cm.
2.1.2. IoT Environmental Data Acquisition and Labeling
The environmental monitoring system employed in this study was designed to collect and structure a multi-year dataset spanning from 2020 to 2024. Data acquisition relied on IoT-enabled weather stations installed in the vineyards to ensure the continuous and consistent monitoring of key parameters, including air temperature, relative humidity, and rainwater accumulation. These variables were recorded daily, providing a comprehensive overview of vineyard microclimatic conditions over multiple growing seasons.
Data collection was carried out using an IFARMING weather station (
https://ifarming.srl/products/?lang=en accessed on 21 February 2025), which featured a thermo-hygrometer sensor housed in a passive solar radiation shield for accurate air temperature and humidity measurements. Also, an additional component of a tipping-bucket rain gauge was used for rainfall accumulation. The system was powered by a 1W solar panel, with a data transmitter ensuring seamless data transfer via an internal or external antenna depending on signal quality.
Data collected by these weather stations were transmitted and stored on dedicated IoT platforms. The IFARMING platform (
https://ifarming.srl/platform/?lang=en accessed on 21 February 2025) was used for real-time visualization, historical analysis, and integration with decision support systems. All collected data, including meteorological, soil, and vineyard management records, were systematically processed and structured into standardized Excel files. The meteorological dataset was categorized based on attributes such as the vineyard structural reference, company name, field identification, crop variety, and measurement date. The environmental variables were further refined through statistical aggregation, including the average, minimum, maximum, sum, deviation, mode, and median, allowing for improved data interpretation and integration across multiple years.
Alongside environmental monitoring, vineyard treatments were meticulously documented to establish a correlation between disease management interventions and preceding climatic conditions. For each application, the records included the date, type of product, dosage, applied volume, and target pathogen. Treatments were explicitly annotated in the meteorological dataset, with environmental data from up to five days prior also included. These applied treatment annotations were used as the labels on our experiments.
2.2. Data Augmentation and Preprocessing
One of the main reasons that ML models do not perform as well as they could is data scarcity and also the nature of the real data used to train a model. Often, real data may be difficult to capture/record relative to a sufficient amount that would allow for model convergence. In addition, they may under-represent the situation of the task, favoring one or some cases over others. A way to increase the amount of data in a qualitative manner that addresses data scarcity and improves the model’s generalization ability is to use synthetic data that mimic the properties of the original data; thus, we perform synthetic data generation using Gaussian copula and additive Gaussian noise. A problem that is often encountered on tabular datasets is that of missing (or NaN) values. But since the provided data gave us measurements once per day, along with the maximum and minimum measurement of the respective feature, this was not a problem in our case, so no preprocessing for missing values took place.
2.2.1. Gaussian-Copula-Based Synthetic Data Generation
Inspired by studies that worked on the generation of synthetic data for an ML emulator using environmental data [
22] or those that suggested a copula-based framework for the generation of synthetic data [
23], we leveraged the efficiency of the simple statistical methods (multivariate CDF) of the Gaussian copula [
24] to generate new synthetic data. The amount of initial data was too small to allow the model to be properly trained on it. The Gaussian copula can model and estimate the (Gaussian) distribution of the features and find the inter-correlation among them. It carries this out by estimating the joint cumulative distribution function (joint CDF) of the features. A Gaussian copula is constructed from a multivariate normal distribution with correlation matrix P. The reverse steps of the copula’s computation are used to generate pseudo-random samples from general classes of multivariate probability distributions. In our case (that is, tabular data of environmental measurements), copulas and their inverse transformation methods are very useful for producing a synthetic dataset that mimics real data in terms of the association of their features [
25].
2.2.2. Additive-Gaussian-Noise-Based Synthetic Data Generation
In addition to augmenting the data using a Gaussian copula, we also applied Gaussian noise for augmentation. The generation of synthetic data using additive Gaussian Noise has already been explored relative to environmental data by [
26]. The Gaussian noise standard deviation was set to 0.1 times the standard deviation of the original feature for each feature column, as this value allows the generation of new features that are close to the original ones but that also have a satisfying degree of variance.
2.2.3. Data Normalization (Standardization)
We performed normalization and specifically standardization, which is a common method where each feature is transformed so that it has a mean value of zero and a standard deviation of one. The mean value and standard deviation were considered for the training data. The formula that describes the standardization process is the following:
where
z is the standardized value of
x (which is the value of the feature to be normalized),
and
are the mean values of the training data, and
is the standard deviation of the training data.
When features vary widely in scale (for example, a tabular dataset including age in years and also income in USD), algorithms that use distance metrics (such as k-nearest neighbors or support vector machines) or gradient-based optimization (like neural networks and linear regression with gradient descent, similarly to our case) can be biased toward features with larger magnitudes. By normalizing input features, we form them on a comparable scale, ensuring that each feature influences the outcome in a similar manner, so it is an essential step before feeding the data to a model.
2.2.4. Data Balancing
In real-world applications, imbalanced datasets are common, where one class significantly outnumbers the other(s). This imbalance is also evident in our dataset, which is expected, since pathogens typically proliferate when environmental conditions are favorable or when no treatment is applied. The SMOTE [
27] algorithm was chosen for oversampling as a method to form a more balanced dataset.
Of course, we first split the data into 80% for training and 20% for testing, and then, we normalized and oversampled them based on the training data. Before oversampling, we had 1300 training data points, of which 988 belonged to class 0 (no disease) and 312 to class 1 (disease). After oversampling, the ratio of the minority class to the majority was 80%. The amount of training data was raised to 1778, of which 790 belonged to class 1 while the amount of class 0 data remained the same.
2.3. Grapevine Disease Prediction Using ML
We formed the pathogen prediction task as a binary classification problem using tabular data, aiming to determine whether there is a high risk of powdery mildew and downy mildew development. To evaluate model performance, we employed various classifiers, including logistic regression, k-nearest neighbors (KNNs), support vector machine (SVM), random forest, gradient boosting, XGBoost, CatBoost, and the recently introduced transformer-based TabPFN classifier. The following sections briefly outline how each model was used and trained. Given that the employed machine learning algorithms, except for TabPFN, are well established, we focus our methodological exposition on TabPFN, a state-of-the-art model for tabular data prediction.
2.3.1. Logistic Regression
Logistic regression was used as a simple and fast baseline ML method. It is well suited for our problem as a statistical method that solves binary problems. As a solver for estimating the parameters of logistic regression, the LBFGS (limited-memory Broyden–Fletcher–Goldfarb–Shanno) algorithm was used [
28], which is an efficient optimization algorithm that uses limited memory resources.
2.3.2. KNN
The k-nearest neighbor is the simplest ML classifier. Having stored examples from each class as training examples, the new test examples are compared in terms of distance to the training examples, and the class assigned to each new example is that of the majority of the closest k neighbor examples. We set the number of neighbors to 3, and the distance function was Euclidean.
2.3.3. Support Vector Machine (SVM)
Since we have a binary classification task, SVM is also chosen as a supervised algorithm that tries to maximize the margin between the closest points of the two classes. The kernel of the classifier was the radial basis function, and the gamma parameter of the RBF function was set equal to the inverse of the number of features times the variance of the data.
2.3.4. Random Forest
A way to obtain stronger predictive results and achieve robustness is to use ensemble methods like random forest. The creation of a multitude of decision trees during training often leads to stronger classification results. Experimentally, we found that 10 is the optimum maximum tree depth for our data.
2.3.5. Gradient Boosting
Also an ensemble method, gradient boosting reveals its power by combining several weak learners into a strong learner, in which each new model is trained to minimize the loss function, such as the mean squared error or cross-entropy of the previous model, using gradient descent. We set the number of estimators (that is, the number of trees in the ensemble) to 100, and the maximum depth of each tree is equal to 8.
2.3.6. XGBoost
XGBoost [
29] is a computationally optimized algorithm that is powerful for structured or tabular data like ours. XGBoost is an extension of gradient boosting, and it was selected here for its very good performance in a wide range of classification tasks. We set its maximum tree depth to 5.
2.3.7. CatBoost
CatBoost [
30], as an ensemble of decision trees, was chosen for many reasons. By using ordered target statistics and ordered boosting, it can avoid data leakage; thus, it is less possible to have overfitting. On tabular data, it performs similarly or even better with XGBoost, and since it is a tree-based method, it can provide some level of explainability; this is useful in real applications where a farmer may ask for clarification about the model’s output.
2.3.8. TabPFN Transformer
TabPFN transformer [
31] is a newly proposed tabular classification algorithm based on the prior-fitted network [
32], which proves that a transformer [
33] network can approximate Bayesian inferences. TabPFN brings a radical new view on the way that tabular classification is carried out. It provides fast and accurate results since it is trained using a meta-learning approach of synthetic datasets generated from a carefully chosen prior over tabular data distributions. This is the reason that this model architecture can perform well across diverse tabular prediction tasks with minimal or even no hyperparameter tuning.
The TabPFN transformer leverages, in a slightly different way, the core module of the transformer architecture [
33]. The self-attention mechanism computes attention weights that capture the relationships among input features, enabling the model to contextualize each element based on its relevance to others. Three different linear projections are computed from the given input vector
and a set of queries,
Q (
), keys,
K (
), and values,
V (
). Based on these, the self-attention mechanism is defined as follows:
where
is the dimensionality of the vector
. The authors of [
31] utilize the standard self-attention mechanism; however, their approach differs in how attention computations are handled between the training and testing phases. They make use of two modules: one that computes self-attention among training data and another that computes cross-attention from testing examples to training ones.
The TabPFN transformer distinguishes itself through its unique inference mechanism. During inference, the model receives both the training dataset and the test samples as inputs but does not undergo additional training on the provided data. Instead, it leverages the combined information to approximate the posterior predictive distribution (PPD), denoted as in a single forward pass. This approach enables the model to predict the output of the test samples by effectively contextualizing them within the training data. TabPFN is (pre-)trained on more than 1 million synthetic datasets generated from a carefully designed prior distribution that simulates realistic tabular data patterns through causal reasoning principles. This training enables the model to approximate Bayesian inferences for new datasets and perform real-time predictions while maintaining trained fixed weights during the prior-fitting phase.
Moreover, the TabPFN transformer omits positional encoding; it becomes redundant if the input is not sequential, but it applies zero-padding and linear scaling to the features to allow the model to accept inputs of variable lengths. In addition, the preprocessing of input X is performed to handle missing values and heterogeneous feature types, influencing subsequent projections of Q, K, and V.
Apart from its very good performance, the TabPFN transformer also has the advantage of yielding predictions in less than a second; this is important in terms of sustainability since it allows for affordable and green ML and also in terms of time complexity, making it a good choice for a real-time system.
2.4. Evaluation Metrics
Accuracy, which gives the proportion of the total predictions that were correct, is the typical metric in classification tasks. Apart from it, we also use the ROC-AUC score, precision, recall, and F1 score as evaluation metrics in order to gain a better view of the model’s behavior and performance.
The ROC-AUC score computes the area under the curve (AUC) of the receiver operating characteristic (ROC). ROC is a figure of sensitivity (true positive rate) over specificity (false positive rate). The AUC measures the entire area under the ROC curve. It is less influenced by changes in class distribution than accuracy. We used its weighted version, which computes metrics for each label and finds their average, weighted by the number of true instances for each label.
Precision and recall are useful since their high values indicate a few false positives and a few false negatives, respectively. However, both the number of false positives and false negatives should be considered, and the F1 score integrates them as a harmonic mean of precision and recall.
We measure precision, recall, and F1 score evaluation metrics for both classes separately to see how the model behaves in each case; we are more interested in the minority class: that is, the case with a high risk of a grapevine disease actually existing.
3. Results
As mentioned above, we experimented with different ML prediction models on our collected and curated IoT tabular data. Some of the models were chosen as baseline methods, like logistic regression and KNN, which are traditionally used for tabular data prediction, and others were chosen for their proven very good performance, such as XGBoost [
29], CatBoost [
30], and also the state-of-the-art TabPFN transformer [
31]. The comparison results of the different ML models for predicting the risk of the infection of downy mildew and powdery mildew are shown in
Table 1 and
Table 2, respectively. The amount of test data was 326 seven-length feature vectors of the three environmental parameters (minimum, maximum, and mean value of temperature and humidity and the value of rain accumulation per day): that is, measurements and fungicide treatment records corresponding to 326 days.
Considering the results, logistic regression, SVM, and KNN had rather poor performances; after all, their usage was useful as baseline results. The problem appeared difficult to solve successfully using ensemble and boosting methods like random forest, gradient boosting, XGBoost, and CatBoost; their accuracy results seemed to be good, but a better look at the “Yes” subcolumn (disease existence) of the precision, recall, and F1 score shows us that the pretty good accuracy results of these classifiers were due to their learned ability to better recognize the non-existence of diseases rather than the opposite, which is what we want. The problem became harder for these classifiers possibly because of the labeling of more high-risk disease days before the actual fungicide treatments. Our expectations were met with the TabPFN transformer, which was the only classifier that had high precision and recall results for the “Yes” class. This class reveals a high disease presence or development risk.
4. Discussion
Our aim in this work was to suggest a way to develop a performance-robust and real-time system to predict downy mildew and powdery mildew. We wanted to see which ML classifier performed better. Taking into account the results of
Table 1 and
Table 2, the TabPFN transformer proved that it can be used as a predictor in a real-time system.
Our IoT sensors monitored three environmental features: temperature, relative humidity, and the amount of rainwater remaining in the soil. However, more environmental parameters play an important role in allowing pathogens to cause diseases in grapevines. Parameters like solar radiation, leaf wetness, soil moisture and temperature, and wind speed and direction (especially for downy mildew) should also be considered to create more realistic models with respect to the conditions under which the diseases are initially observed on the crops and those progressing and affecting them. Such environmental modeling would help better distinguish differing environmental conditions at different levels, revealing different diseases or no disease at all.
Of course, it is crucial to monitor such environmental measurements for a long period and to also keep a detailed record of when any treatment is applied to prevent diseases. Thus, quantitative and qualitative data would have to be formed to help create an unbiased ML model with good generalization abilities that can predict the appearance of a disease in any environmental condition.
Further research that addresses some or all of these limitations is needed for the early prediction of grapevine diseases. Nonetheless, with the current variety of IoT sensors and the TabPFN transformer, our suggested system showed very good performance, and we are eager to apply this system to real conditions in order to perform early downy mildew and powdery mildew disease prediction. The reason for this very good performance is due to the fact that, among the aforementioned environmental parameters that can be utilized to better model such problems, temperature and humidity are the most significant for the prediction of these two diseases.