Utilizing TabPFN Transformer with IoT Environmental Data for Early Prediction of Grapevine Diseases

Arvanitis, Nikolaos; Graziosi, Filippo; Athanasiou, Gina; Terpou, Antonia; Arvaniti, Olga; Zahariadis, Theodore

doi:10.3390/agriengineering7060173

Open AccessArticle

Utilizing TabPFN Transformer with IoT Environmental Data for Early Prediction of Grapevine Diseases

by

Nikolaos Arvanitis

^1,2,*

,

Filippo Graziosi

³,

Gina Athanasiou

¹

,

Antonia Terpou

⁴

,

Olga Arvaniti

⁴

and

Theodore Zahariadis

^1,4

¹

Synelixis Solutions S.A., 10 Farmakidou Av, 34100 Chalkida, Greece

²

Computer Engineering & Informatics Department, University of Patras, Rion, 26504 Patras, Greece

³

RiNOVA, Via dell’Arrigoni, 120, 47522 Cesena, FC, Italy

⁴

Department of Agricultural Development, Agri-Food and Natural Resources Management, School of Agricultural Development, Nutrition & Sustainability, National and Kapodistrian University of Athens, Evripos Campus, 34400 Evia, Greece

^*

Author to whom correspondence should be addressed.

AgriEngineering 2025, 7(6), 173; https://doi.org/10.3390/agriengineering7060173

Submission received: 11 March 2025 / Revised: 1 April 2025 / Accepted: 9 April 2025 / Published: 3 June 2025

(This article belongs to the Special Issue Transforming Agriculture with Artificial Intelligence: Recent Advances and Applications)

Download

Browse Figure

Review Reports Versions Notes

Abstract

Downy mildew and powdery mildew are among the most serious diseases that affect grapevine. They can cause severe damage, such as yield loss, and affect the size of the grapes and their ability to accumulate sugars, affecting the flavor and aroma negatively and increasing the need for fungicidal sprays to combat these diseases and the pathogens that cause them. Clearly, it is important to predict these diseases early and apply treatment promptly to prevent and mitigate the effects of these diseases on crop production. This study presents a workflow in which IoT environmental sensors and machine learning methods are leveraged to accurately predict disease onset and allow for timely fungicide applications or other disease management strategies. We collected IoT grapevine field measurements and leveraged the records of the respective time periods during which fungicide treatments were applied to grapevine, and we used them to train and evaluate different ML tabular data classifiers as early predictors for each of the two diseases. The TabPFN transformer demonstrated superior performance in disease risk assessment while enabling real-time predictions with sub-second latencies, so it can be considered as a very good choice for a real-time grapevine disease prediction system.

Keywords:

downy mildew; powdery mildew; grapevine disease; prediction; TabPFN transformer; XGBoost; CatBoost; IoT environmental data; early prediction

1. Introduction

Crop diseases that affect grapevines, such as downy mildew and powdery mildew, are undesirable due to the consequences caused by their severe infection. The quantity and also the quality of grapevine are decreased, leading to yield and financial loss for farmers. Different fungal or fungal-like pathogens and pest infestation are the cause of disease development. Various fungicides are used to control pathogens, but their extensive use has led to environmental degradation and biodiversity loss and has raised health concerns for both farmers and consumers. Here, the suggested workflow solution for the early prediction of diseases would lead to increased crop yield, better grapevine and wine quality, and reduced costs for farmers.

Downy mildew is a severe fungal disease that is caused by the Plasmopara viticola pathogen. Downy mildew’s symptoms are present on both the leaves and vines. Once established on a plant, Plasmopara viticola spreads rapidly through secondary infections, producing a characteristic white, downy sporulation on the underside of the leaves. On the upper surface, oil-spot lesions appear which can evolve into white fungal growth underneath the surface and later turn necrotic [1]. If left uncontrolled, the disease leads to premature defoliation, weakens the vines, and causes significant reductions in grape yield [2].

Historically, downy mildew has devastated vineyards, particularly in Europe. For example, it caused a 70% reduction in French grape production in 1915 and significant losses in Germany and Italy during the early 20th century [1,3]. The disease not only reduces yields but also affects the organoleptic characteristics of wine, affecting its appearance, aroma, and taste.

Powdery mildew is caused by the Erysiphe Necator pathogen and differs from downy mildew due to the lower moisture level requirement for infecting grapevines. An obvious symptom of this disease is the white powder cover on leaves that can affect all parts of the vine, including its fruit. Since this white powder covers a major part of the leaves, it leads to reduced photosynthesis and exposes berries to sunburn due to defoliation [4,5].

The significant impact of powdery mildew is observed in juice quality: that is, the combination of sugar accumulation, the intensity of the red color, browning, and acidity [6]. Powdery mildew appears to affect sugar accumulation in grapevines in a manner similar to other chronic stress factors, such as drought or defoliation [7]. Failing to control powdery mildew may lead to decreased Brix levels (even lower than the levels accepted by processors) and also chronic reductions in grapevine growth.

Of course, control measures are implemented to mitigate the symptoms of the disease, but still, the economic loss is substantial. The disease reduces yields and also lowers fruit quality. Particularly, a 50% disease incidence in vineyards in South India resulted in substantial losses [8]. Pool et al. [9] recorded a 40% reduction in the vine size and a 65% reduction in the yield of Rossetee [10] because of powdery mildew.

A grapevine disease prediction model would fail if it did not take environmental conditions into consideration because they are crucial in enabling the pathogens that cause the diseases [11,12]. Specifically, downy mildew needs a moderate environment (about 15–23 °C) and high humidity (over 80%). Also, rainfall plays a critical role in downy mildew outbreaks by providing the necessary moisture for spore germination and dispersal [4]. On the other hand, powdery mildew needs warmer conditions (about 17–28 °C) and is not as dependent on moisture levels for infection to occur. Temperatures above 40°C stop its development, and the relative humidity should be above 45% [10,13,14].

Thus, taking into consideration that specific environmental conditions, such as temperature and humidity levels, affect the rate of plant disease infection, a pipeline system driven by weather and environmental parameters to predict the pathogen’s presence in grapevines and the disease at an early stage is suggested in this work. The importance of an early grapevine disease prediction system is also highlighted by the fact that disease symptoms are not clearly visible in the early stages of infection, in which case the plant needs to be carefully inspected by the farmer.

Quite a few studies have explored crop disease prediction, with a particular focus on grapevine diseases, utilizing environmental Internet-of-Things (IoT) data. Various machine learning (ML) approaches have been employed, including Hidden Markov Models (HMMs) [15] or decision trees and neural networks [16], while others just followed a rule-based approach [17]. Convolutional neural networks (CNNs) and deep learning techniques have also been applied in image-based analyses for grapevine disease recognition [18,19]. Additionally, a publicly available dataset [20], comprising three features over five months, has been utilized in hybrid rule-based predictions for downy mildew and powdery mildew diseases. In contrast, our dataset spans over four years and includes annotations for both diseases, providing more extensive temporal coverage and detailed labeling. A recent study similar to ours [21], by Zhao and Efremova, combined multispectral imagery and environmental parameters to predict grapevine diseases at the block level. Their approach incorporated numerous features, which may have increased the complexity of the prediction task. Consequently, the TabPFN transformer did not significantly outperform models like XGBoost or CatBoost in their study. In contrast, our work leverages a curated, multiyear dataset of IoT sensor measurements, focusing on critical environmental features that facilitate the development of downy mildew and powdery mildew. We integrate these features with labels indicating whether fungicide treatments were applied, aiming to enhance predictive accuracy while maintaining feature simplicity.

Our objective is to perform early assessments of grapevine disease risks using monitored IoT environmental data and the corresponding treatment labels. By employing the TabPFN transformer, which delivers outputs in under one second, our workflow supports real-time applications in precision viticulture. This approach allows for proactive interventions before the manifestation of visible symptoms, enhancing overall grapevine health, yields, and fruit quality while concurrently reducing fungicide applications, agricultural input costs, and environmental impacts.

From a machine learning perspective, the early prediction and detection of grapevine diseases involve training models to recognize environmental patterns associated with disease development. These patterns encompass specific conditions that facilitate pathogen proliferation and the subsequent infection of grapevines. Traditionally, diseases are identified only after visible symptoms manifest on the crop. However, our objective is to anticipate the onset of the disease before such symptoms appear. Using IoT sensors to monitor environmental parameters and feeding these data into machine learning classifiers, we can predict disease risks in a timely manner. This proactive approach allows the assessment of the risks of pathogen infestation and the implementation of treatments before outbreaks occur or as the disease begins to progress. The benefits of this method are evident in the overall health of plants, leading to significant improvements in both the quantity and quality of grapevine yields. In addition, early disease prediction enables the application of precise treatment amounts, thereby reducing costs for farmers. The contribution of our work can be summarized as follows:

We make our self-curated and disease-annotated IoT environmental data publicly available, spanning from 2020 to May of 2024, facilitating further research in this domain.
We perform a comparative analysis of different tabular data classifiers.
We demonstrate the efficacy of the TabPFN transformer on IoT environmental tabular data, highlighting its advantages over other predictors.
We present a workflow capable of early disease prediction and operable under real-time conditions due to the rapid inference capabilities of the TabPFN transformer.

2. Materials and Methods

This study aims to develop an early prediction system for grapevine diseases based on environmental conditions, thereby enabling timely interventions to prevent disease progression. To achieve this, we collected environmental data, specifically temperature, humidity, and rainfall accumulation, using IoT sensors deployed in vineyard settings. These parameters were processed into a feature vector comprising seven elements: mean, minimum, and maximum values for both temperature and humidity, along with cumulative rainfall measurements.

These feature vectors served as inputs for ML classifiers designed to assess the risk of disease outbreaks. The models were trained using binary labels indicating the presence or absence of disease, inferred from records of fungicide applications by farmers and agronomists as preventive or curative measures.

Given the challenges associated with real-world datasets, such as limited sample sizes and class imbalances, data augmentation and balancing techniques were employed to enhance the model’s robustness. We evaluated multiple classification algorithms to determine the most effective model for predicting disease risk.

The dataset encompassed environmental data collected from 2020 to 2024, which, along with the augmented data, were used for training and testing the ML models. Data from 2023 and 2024 were reserved for testing and validation purposes. This approach aligns with contemporary practices in precision viticulture, where integrating IoT sensor data with ML algorithms facilitates proactive disease management strategies.

The overall suggested workflow and the steps followed to perform grapevine diseases prediction are shown in Figure 1.

2.1. Data Description

2.1.1. Grapevine Field Description

This study was conducted in a 7-hectare experimental vineyard located in Tebano (Faenza, RA, Italy), Emilia-Romagna, Italy. The vineyard is situated on flat terrain characterized by a clayey loam soil. It is cultivated under integrated crop management methods, combining sustainable agronomic practices to optimize vine health and productivity. The vineyard hosts multiple grapevine cultivars, including Sangiovese, Trebbiano, Pignoletto, Albana, Lambrusco Salamino, Lambrusco Sorbara, and Ancellotta, all grafted onto Kober 5BB rootstocks. The vines were planted in 2018 and were trained using the Guyot system, which was designed for low-to-moderate vigor vineyards. The planting layout consists of 2.7 m between the rows and 1 meter between vines within the same row. The vineyard is on a flat surface and oriented southwest. Irrigation was managed through a drip irrigation system at an irrigation flow rate of 4 L/h, with drippers spaced every 50 cm.

2.1.2. IoT Environmental Data Acquisition and Labeling

The environmental monitoring system employed in this study was designed to collect and structure a multi-year dataset spanning from 2020 to 2024. Data acquisition relied on IoT-enabled weather stations installed in the vineyards to ensure the continuous and consistent monitoring of key parameters, including air temperature, relative humidity, and rainwater accumulation. These variables were recorded daily, providing a comprehensive overview of vineyard microclimatic conditions over multiple growing seasons.

Data collection was carried out using an IFARMING weather station (https://ifarming.srl/products/?lang=en accessed on 21 February 2025), which featured a thermo-hygrometer sensor housed in a passive solar radiation shield for accurate air temperature and humidity measurements. Also, an additional component of a tipping-bucket rain gauge was used for rainfall accumulation. The system was powered by a 1W solar panel, with a data transmitter ensuring seamless data transfer via an internal or external antenna depending on signal quality.

Data collected by these weather stations were transmitted and stored on dedicated IoT platforms. The IFARMING platform (https://ifarming.srl/platform/?lang=en accessed on 21 February 2025) was used for real-time visualization, historical analysis, and integration with decision support systems. All collected data, including meteorological, soil, and vineyard management records, were systematically processed and structured into standardized Excel files. The meteorological dataset was categorized based on attributes such as the vineyard structural reference, company name, field identification, crop variety, and measurement date. The environmental variables were further refined through statistical aggregation, including the average, minimum, maximum, sum, deviation, mode, and median, allowing for improved data interpretation and integration across multiple years.

Alongside environmental monitoring, vineyard treatments were meticulously documented to establish a correlation between disease management interventions and preceding climatic conditions. For each application, the records included the date, type of product, dosage, applied volume, and target pathogen. Treatments were explicitly annotated in the meteorological dataset, with environmental data from up to five days prior also included. These applied treatment annotations were used as the labels on our experiments.

2.2. Data Augmentation and Preprocessing

One of the main reasons that ML models do not perform as well as they could is data scarcity and also the nature of the real data used to train a model. Often, real data may be difficult to capture/record relative to a sufficient amount that would allow for model convergence. In addition, they may under-represent the situation of the task, favoring one or some cases over others. A way to increase the amount of data in a qualitative manner that addresses data scarcity and improves the model’s generalization ability is to use synthetic data that mimic the properties of the original data; thus, we perform synthetic data generation using Gaussian copula and additive Gaussian noise. A problem that is often encountered on tabular datasets is that of missing (or NaN) values. But since the provided data gave us measurements once per day, along with the maximum and minimum measurement of the respective feature, this was not a problem in our case, so no preprocessing for missing values took place.

2.2.1. Gaussian-Copula-Based Synthetic Data Generation

Inspired by studies that worked on the generation of synthetic data for an ML emulator using environmental data [22] or those that suggested a copula-based framework for the generation of synthetic data [23], we leveraged the efficiency of the simple statistical methods (multivariate CDF) of the Gaussian copula [24] to generate new synthetic data. The amount of initial data was too small to allow the model to be properly trained on it. The Gaussian copula can model and estimate the (Gaussian) distribution of the features and find the inter-correlation among them. It carries this out by estimating the joint cumulative distribution function (joint CDF) of the features. A Gaussian copula is constructed from a multivariate normal distribution with correlation matrix P. The reverse steps of the copula’s computation are used to generate pseudo-random samples from general classes of multivariate probability distributions. In our case (that is, tabular data of environmental measurements), copulas and their inverse transformation methods are very useful for producing a synthetic dataset that mimics real data in terms of the association of their features [25].

2.2.2. Additive-Gaussian-Noise-Based Synthetic Data Generation

In addition to augmenting the data using a Gaussian copula, we also applied Gaussian noise for augmentation. The generation of synthetic data using additive Gaussian Noise has already been explored relative to environmental data by [26]. The Gaussian noise standard deviation was set to 0.1 times the standard deviation of the original feature for each feature column, as this value allows the generation of new features that are close to the original ones but that also have a satisfying degree of variance.

2.2.3. Data Normalization (Standardization)

We performed normalization and specifically standardization, which is a common method where each feature is transformed so that it has a mean value of zero and a standard deviation of one. The mean value and standard deviation were considered for the training data. The formula that describes the standardization process is the following:

z = \frac{x - μ}{σ}

(1)

where z is the standardized value of x (which is the value of the feature to be normalized),

μ

and

σ

are the mean values of the training data, and

σ

is the standard deviation of the training data.

When features vary widely in scale (for example, a tabular dataset including age in years and also income in USD), algorithms that use distance metrics (such as k-nearest neighbors or support vector machines) or gradient-based optimization (like neural networks and linear regression with gradient descent, similarly to our case) can be biased toward features with larger magnitudes. By normalizing input features, we form them on a comparable scale, ensuring that each feature influences the outcome in a similar manner, so it is an essential step before feeding the data to a model.

2.2.4. Data Balancing

In real-world applications, imbalanced datasets are common, where one class significantly outnumbers the other(s). This imbalance is also evident in our dataset, which is expected, since pathogens typically proliferate when environmental conditions are favorable or when no treatment is applied. The SMOTE [27] algorithm was chosen for oversampling as a method to form a more balanced dataset.

Of course, we first split the data into 80% for training and 20% for testing, and then, we normalized and oversampled them based on the training data. Before oversampling, we had 1300 training data points, of which 988 belonged to class 0 (no disease) and 312 to class 1 (disease). After oversampling, the ratio of the minority class to the majority was 80%. The amount of training data was raised to 1778, of which 790 belonged to class 1 while the amount of class 0 data remained the same.

2.3. Grapevine Disease Prediction Using ML

We formed the pathogen prediction task as a binary classification problem using tabular data, aiming to determine whether there is a high risk of powdery mildew and downy mildew development. To evaluate model performance, we employed various classifiers, including logistic regression, k-nearest neighbors (KNNs), support vector machine (SVM), random forest, gradient boosting, XGBoost, CatBoost, and the recently introduced transformer-based TabPFN classifier. The following sections briefly outline how each model was used and trained. Given that the employed machine learning algorithms, except for TabPFN, are well established, we focus our methodological exposition on TabPFN, a state-of-the-art model for tabular data prediction.

2.3.1. Logistic Regression

Logistic regression was used as a simple and fast baseline ML method. It is well suited for our problem as a statistical method that solves binary problems. As a solver for estimating the parameters of logistic regression, the LBFGS (limited-memory Broyden–Fletcher–Goldfarb–Shanno) algorithm was used [28], which is an efficient optimization algorithm that uses limited memory resources.

2.3.2. KNN

The k-nearest neighbor is the simplest ML classifier. Having stored examples from each class as training examples, the new test examples are compared in terms of distance to the training examples, and the class assigned to each new example is that of the majority of the closest k neighbor examples. We set the number of neighbors to 3, and the distance function was Euclidean.

2.3.3. Support Vector Machine (SVM)

Since we have a binary classification task, SVM is also chosen as a supervised algorithm that tries to maximize the margin between the closest points of the two classes. The kernel of the classifier was the radial basis function, and the gamma parameter of the RBF function was set equal to the inverse of the number of features times the variance of the data.

2.3.4. Random Forest

A way to obtain stronger predictive results and achieve robustness is to use ensemble methods like random forest. The creation of a multitude of decision trees during training often leads to stronger classification results. Experimentally, we found that 10 is the optimum maximum tree depth for our data.

2.3.5. Gradient Boosting

Also an ensemble method, gradient boosting reveals its power by combining several weak learners into a strong learner, in which each new model is trained to minimize the loss function, such as the mean squared error or cross-entropy of the previous model, using gradient descent. We set the number of estimators (that is, the number of trees in the ensemble) to 100, and the maximum depth of each tree is equal to 8.

2.3.6. XGBoost

XGBoost [29] is a computationally optimized algorithm that is powerful for structured or tabular data like ours. XGBoost is an extension of gradient boosting, and it was selected here for its very good performance in a wide range of classification tasks. We set its maximum tree depth to 5.

2.3.7. CatBoost

CatBoost [30], as an ensemble of decision trees, was chosen for many reasons. By using ordered target statistics and ordered boosting, it can avoid data leakage; thus, it is less possible to have overfitting. On tabular data, it performs similarly or even better with XGBoost, and since it is a tree-based method, it can provide some level of explainability; this is useful in real applications where a farmer may ask for clarification about the model’s output.

2.3.8. TabPFN Transformer

TabPFN transformer [31] is a newly proposed tabular classification algorithm based on the prior-fitted network [32], which proves that a transformer [33] network can approximate Bayesian inferences. TabPFN brings a radical new view on the way that tabular classification is carried out. It provides fast and accurate results since it is trained using a meta-learning approach of synthetic datasets generated from a carefully chosen prior over tabular data distributions. This is the reason that this model architecture can perform well across diverse tabular prediction tasks with minimal or even no hyperparameter tuning.

The TabPFN transformer leverages, in a slightly different way, the core module of the transformer architecture [33]. The self-attention mechanism computes attention weights that capture the relationships among input features, enabling the model to contextualize each element based on its relevance to others. Three different linear projections are computed from the given input vector

X

and a set of queries, Q (

Q = {XW}^{Q}

), keys, K (

K = {XW}^{K}

), and values, V (

V = {XW}^{V}

). Based on these, the self-attention mechanism is defined as follows:

self - attention (Q, K, V) = s o f t m a x (\frac{{QK}^{T}}{\sqrt{d_{K}}}) V

(2)

where

d_{K}

is the dimensionality of the vector

K

. The authors of [31] utilize the standard self-attention mechanism; however, their approach differs in how attention computations are handled between the training and testing phases. They make use of two modules: one that computes self-attention among training data and another that computes cross-attention from testing examples to training ones.

The TabPFN transformer distinguishes itself through its unique inference mechanism. During inference, the model receives both the training dataset and the test samples as inputs but does not undergo additional training on the provided data. Instead, it leverages the combined information to approximate the posterior predictive distribution (PPD), denoted as

q_{θ} (y | x_{t e s t}, D_{t r a i n})

in a single forward pass. This approach enables the model to predict the output of the test samples by effectively contextualizing them within the training data. TabPFN is (pre-)trained on more than 1 million synthetic datasets generated from a carefully designed prior distribution that simulates realistic tabular data patterns through causal reasoning principles. This training enables the model to approximate Bayesian inferences for new datasets and perform real-time predictions while maintaining trained fixed weights during the prior-fitting phase.

Moreover, the TabPFN transformer omits positional encoding; it becomes redundant if the input is not sequential, but it applies zero-padding and linear scaling to the features to allow the model to accept inputs of variable lengths. In addition, the preprocessing of input X is performed to handle missing values and heterogeneous feature types, influencing subsequent projections of Q, K, and V.

Apart from its very good performance, the TabPFN transformer also has the advantage of yielding predictions in less than a second; this is important in terms of sustainability since it allows for affordable and green ML and also in terms of time complexity, making it a good choice for a real-time system.

2.4. Evaluation Metrics

Accuracy, which gives the proportion of the total predictions that were correct, is the typical metric in classification tasks. Apart from it, we also use the ROC-AUC score, precision, recall, and F1 score as evaluation metrics in order to gain a better view of the model’s behavior and performance.

The ROC-AUC score computes the area under the curve (AUC) of the receiver operating characteristic (ROC). ROC is a figure of sensitivity (true positive rate) over specificity (false positive rate). The AUC measures the entire area under the ROC curve. It is less influenced by changes in class distribution than accuracy. We used its weighted version, which computes metrics for each label and finds their average, weighted by the number of true instances for each label.

Precision and recall are useful since their high values indicate a few false positives and a few false negatives, respectively. However, both the number of false positives and false negatives should be considered, and the F1 score integrates them as a harmonic mean of precision and recall.

We measure precision, recall, and F1 score evaluation metrics for both classes separately to see how the model behaves in each case; we are more interested in the minority class: that is, the case with a high risk of a grapevine disease actually existing.

3. Results

As mentioned above, we experimented with different ML prediction models on our collected and curated IoT tabular data. Some of the models were chosen as baseline methods, like logistic regression and KNN, which are traditionally used for tabular data prediction, and others were chosen for their proven very good performance, such as XGBoost [29], CatBoost [30], and also the state-of-the-art TabPFN transformer [31]. The comparison results of the different ML models for predicting the risk of the infection of downy mildew and powdery mildew are shown in Table 1 and Table 2, respectively. The amount of test data was 326 seven-length feature vectors of the three environmental parameters (minimum, maximum, and mean value of temperature and humidity and the value of rain accumulation per day): that is, measurements and fungicide treatment records corresponding to 326 days.

Considering the results, logistic regression, SVM, and KNN had rather poor performances; after all, their usage was useful as baseline results. The problem appeared difficult to solve successfully using ensemble and boosting methods like random forest, gradient boosting, XGBoost, and CatBoost; their accuracy results seemed to be good, but a better look at the “Yes” subcolumn (disease existence) of the precision, recall, and F1 score shows us that the pretty good accuracy results of these classifiers were due to their learned ability to better recognize the non-existence of diseases rather than the opposite, which is what we want. The problem became harder for these classifiers possibly because of the labeling of more high-risk disease days before the actual fungicide treatments. Our expectations were met with the TabPFN transformer, which was the only classifier that had high precision and recall results for the “Yes” class. This class reveals a high disease presence or development risk.

4. Discussion

Our aim in this work was to suggest a way to develop a performance-robust and real-time system to predict downy mildew and powdery mildew. We wanted to see which ML classifier performed better. Taking into account the results of Table 1 and Table 2, the TabPFN transformer proved that it can be used as a predictor in a real-time system.

Our IoT sensors monitored three environmental features: temperature, relative humidity, and the amount of rainwater remaining in the soil. However, more environmental parameters play an important role in allowing pathogens to cause diseases in grapevines. Parameters like solar radiation, leaf wetness, soil moisture and temperature, and wind speed and direction (especially for downy mildew) should also be considered to create more realistic models with respect to the conditions under which the diseases are initially observed on the crops and those progressing and affecting them. Such environmental modeling would help better distinguish differing environmental conditions at different levels, revealing different diseases or no disease at all.

Of course, it is crucial to monitor such environmental measurements for a long period and to also keep a detailed record of when any treatment is applied to prevent diseases. Thus, quantitative and qualitative data would have to be formed to help create an unbiased ML model with good generalization abilities that can predict the appearance of a disease in any environmental condition.

Further research that addresses some or all of these limitations is needed for the early prediction of grapevine diseases. Nonetheless, with the current variety of IoT sensors and the TabPFN transformer, our suggested system showed very good performance, and we are eager to apply this system to real conditions in order to perform early downy mildew and powdery mildew disease prediction. The reason for this very good performance is due to the fact that, among the aforementioned environmental parameters that can be utilized to better model such problems, temperature and humidity are the most significant for the prediction of these two diseases.

5. Conclusions

This study presented a workflow for predicting grapevine diseases as an early warning based on environmental parameters. We explained how downy mildew and powdery mildew diseases affect grapevines and the effect that specific environmental conditions have on the development of these diseases. The suggested workflow includes a full pipeline framework: collecting and labeling IoT sensor data spanning from 2020 to May of 2024, augmenting and preprocessing the data, using them in the appropriate ML model to predict a disease, and assisting in the farmers’ decision on whether a fungicide treatment should be applied to prevent a disease. We have made these data publicly available for others working on this task, including information on when fungicide treatments should be applied as disease-existence labels. The real field data that we collected and used and the TabPFN transformer ML model, which outperformed all other ML predictors, give our suggested solution high application potential in real conditions.

Author Contributions

Conceptualization, N.A.; data curation, F.G.; formal analysis, F.G. and G.A.; investigation, N.A. and F.G.; methodology, N.A.; project administration, T.Z.; software, N.A.; validation, N.A.; writing—original draft, N.A., F.G. and G.A.; writing—review and editing, G.A., A.T., O.A. and T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This project received funding from the European Commission through the HORIZON-RIA action under Call HORIZON-CL6-2022-GOVERNANCE-01 and is funded by grant agreement no. 101086461 (project acronym: AgriDataValue; project name: Smart Farm and Agri-environmental Big Data Value).

Data Availability Statement

The data used at this work are available at https://zenodo.org/records/14989522 (accessed on 6 April 2025).

Conflicts of Interest

Nikos Arvanitis, Gina Athanasiou and Theodore Zahariadis are employed at Synelixis Solutions S.A., and Filippo Graziosi is employed at RINOVA. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationship that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

KNN	K-nearest neighbors
SVM	Support vector machine
TabPFN	Tabular prior-data fitted network

References

Koledenkova, K.; Esmaeel, Q.; Jacquard, C.; Nowak, J.; Clément, C.; Barka, E.A. Plasmopara viticola the Causal Agent of Downy Mildew of Grapevine: From Its Taxonomy to Disease Management. Front. Microbiol. 2022, 13, 889472. [Google Scholar] [CrossRef] [PubMed]
Gessler, C.; Pertot, I.; Perazzolli, M. Plasmopara viticola: A review of knowledge on downy mildew of grapevine and effective disease management. Phytopathol. Mediterr. 2011, 50, 3–44. Available online: http://www.jstor.org/stable/26458675 (accessed on 23 February 2025).
Peng, J.; Wang, X.; Wang, H.; Li, X.; Zhang, Q.; Wang, M. Advances in understanding grapevine downy mildew: From pathogen infection to disease management. Mol. Plant Pathol. 2024, 25, e13401. [Google Scholar] [CrossRef] [PubMed]
Velasquez-Camacho, L.; Otero, M.; Basile, B.; Pijuan, J.; Corrado, G. Current Trends and Perspectives on Predictive Models for Mildew Diseases in Vineyards. Microorganisms 2022, 11, 73. [Google Scholar] [CrossRef]
Ricciardi, V.; Crespan, M.; Maddalena, G.; Migliaro, D.; Brancadoro, L.; Maghradze, D.; Failla, O.; Toffolatti, S.L.; De Lorenzis, G. Novel loci associated with resistance to downy and powdery mildew in grapevine. Front. Plant Sci. 2024, 15, 1386225. Available online: https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2024.1386225 (accessed on 11 February 2025). [CrossRef]
Gadoury, D.; Seem, R.; Pearson, R.; Wilcox, W.; Dunst, R. Effects of Powdery Mildew on Vine Growth, Yield, and Quality of Concord Grapes. Plant-Dis. Plant Dis. 2001, 85, 137–140. [Google Scholar] [CrossRef]
Martinson, T.E.; Dunst, R.; Lakso, A.; English-Loeb, G. Impact of feeding injury by Eastern Grape Leafhopper (Homoptera: Cicadellidae) on yield and juice quality of Concord grapes. Am. J. Enol. Vitic. 1997, 48, 291–302. [Google Scholar] [CrossRef]
Rao, K.C.; Satyanarayana, A. Epidemiology of anthracnose of grape (Vitis vinifera) caused by Elsinoë ampelina around Hyderabad. Indian J. Agric. Sci. 1989, 59, 655–657. [Google Scholar]
Pool, R.M.; Pearson, R.C.; Welser, M.J.; Lasko, A.N.; Seem, R.C. Influence of powdery mildew on yield and growth of rosette grapevine. Plant Dis. 1984, 68, 590–593. [Google Scholar] [CrossRef]
Thind, T.S.; Arora, J.K.; Mohan, C.; Raj, P. Epidemiology of Powdery Mildew, Downy Mildew and Anthracnose Diseases of Grapevine. In Diseases of Fruits and Vegetables; Naqvi, S.A.M.H., Ed.; Springer: Berlin/Heidelberg, Germany, 2004; Volume I. [Google Scholar] [CrossRef]
Willocquet, L.; Berud, F.; Raoux, L.; Clerjeau, M. Effects of wind, relative humidity, leaf movement and colony age on dispersal of conidia of Uncinula necator, causal agent of grape powdery mildew. Plant Pathol. 2007, 47, 234–242. [Google Scholar] [CrossRef]
Fernandes de Oliveira, A.; Serra, S.; Ligios, V.; Satta, D.; Nieddu, G. Assessing the Effects of Vineyard Soil Management on Downy and Powdery Mildew Development. Horticulturae 2021, 7, 209. [Google Scholar] [CrossRef]
Gadoury, D. Effects of environment and fungicides on epidemics of grape powdery mildew: Considerations for practical model development and disease management. Vitic. Enol. Sci. 1997, 52, 225–229. [Google Scholar]
Bois, B.; Zito, S.; Calonnec, A. Climate vs grapevine pests and diseases worldwide: The first results of a global survey. OENO One 2017, 51, 133–139. [Google Scholar] [CrossRef]
Patil, S.S.; Thorat, S.A. Early Detection of Grapes Diseases Using Machine Learning and IoT. In Proceedings of the 2016 Second International Conference on Cognitive Computing and Information Processing (CCIP), Mysuru, India, 12–13 August 2016; pp. 1–5. [Google Scholar] [CrossRef]
Hnatiuc, M.; Ghita, S.; Alpetri, D.; Ranca, A.; Artem, V.; Dina, I.; Cosma, M.; Mohammed, M.A. Intelligent Grapevine Disease Detection Using IoT Sensor Network. Bioengineering 2023, 10, 1021. [Google Scholar] [CrossRef]
Sanghavi, K.; Sanghavi, M.; Rajurkar, A.M. Early stage detection of Downy and Powdery Mildew grape disease using atmospheric parameters through sensor nodes. Artif. Intell. Agric. 2021, 5, 223–232. Available online: https://www.sciencedirect.com/science/article/pii/S2589721721000283 (accessed on 2 February 2025). [CrossRef]
Yadav, A.P.; Thapliyal, N.; Aeri, M.; Kukreja, V.; Sharma, R. Advanced Deep Learning Approaches: Utilizing VGG16, VGG19, and ResNet Architectures for Enhanced Grapevine Disease Detection. In Proceedings of the 2024 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 14–15 March 2024; pp. 1–4. [Google Scholar]
Mandal, R.; Kheir Gouda, M. Vineyard vigilance: Harnessing deep learning for grapevine disease detection. J. Emerg. Investig. 2024. [Google Scholar] [CrossRef] [PubMed]
Gawande, A.; Sherekar, S.; Gawande, R. Early prediction of grape disease attack using a hybrid classifier in association with IoT sensors. Heliyon 2024, 10, e38093. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Zhao, W.; Efremova, N. Grapevine Disease Prediction Using Climate Variables from Multi-Sensor Remote Sensing Imagery via a Transformer Model. arXiv 2024, arXiv:2406.07094. [Google Scholar]
Meyer, D.; Nagler, T.; Hogan, R. Copula-based synthetic data generation for machine learning emulators in weather and climate: Application to a simple radiation model. Geosci. Model Dev. Discuss. 2021, 2021, 1–21. [Google Scholar] [CrossRef]
Li, Z.; Zhao, Y.; Fu, J. SYNC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources. arXiv 2020, arXiv:2009.09471. [Google Scholar]
Nelsen, R.B. An Introduction to Copula; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
Houssou, R.; Augustin, M.-C.; Rappos, E.; Bonvin, V.; Robert-Nicoud, S. Generation and Simulation of Synthetic Datasets with Copulas. arXiv 2022, arXiv:2203.17250. [Google Scholar]
Bilali, A.E.; Taleb, A.; Bahlaoui, M.A.; Brouziyne, Y. An integrated approach based on Gaussian noises-based data augmentation method and AdaBoost model to predict faecal coliforms in rivers with small dataset. J. Hydrol. 2021, 599, 126510. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, P.W. Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16), San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.-V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18), Red Hook, NY, USA, 3–8 December 2018; pp. 6639–6649. [Google Scholar]
Hollmann, N.; Müller, S.; Eggensperger, K.; Hutter, F. TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. arXiv 2023, arXiv:2207.01848. [Google Scholar]
Müller, S.; Hollmann, N.; Arango, S.; Grabocka, J.; Hutter, F. (2022) Transformers can do bayesian inference. In Proceedings of the International Conference on Learning Representations (ICLR’22), Virtual Event, 25–29 April 2022; Available online: https://openreview.net/forum?id=KSugKcbNf9 (accessed on 29 January 2025).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]

Figure 1. Suggested workflow for the early prediction of grapevine diseases using IoT environmental data and machine learning algorithms. Machine learning classifiers are trained for each disease using preprocessed environmental data as input features. The ground truth labels are determined based on historical records of fungicide applications, serving as proxies for disease occurrence. The trained models are evaluated using test samples, which are normalized using the same parameters as the training data. Each model predicts whether or not there is a high risk of disease presence.

Table 1. Downy mildew disease classification results. The value of the classifier with the highest value is denoted in bold at every column.

Classifier	Accuracy	ROC-AUC	Precision		Recall		F1 Score
Classifier	Accuracy	ROC-AUC	No	Yes	No	Yes	No	Yes
Logistic Regression	0.6125	0.6062	0.8058	0.3637	0.6299	0.5942	0.7017	0.4628
KNN	0.7849	0.7632	0.8861	0.5902	0.8099	0.7162	0.8464	0.6509
SVM	0.6987	0.6788	0.8351	0.4675	0.7339	0.6251	0.7807	0.5285
Random Forest	0.8576	0.8245	0.9082	0.7361	0.8959	0.7530	0.9021	0.7394
Gradient Boosting	0.8742	0.7969	0.8816	0.8437	0.9585	0.6352	0.9184	0.7248
XGBoost	0.8379	0.7648	0.8645	0.7428	0.9231	0.6153	0.8977	0.6973
CatBoost	0.8642	0.8112	0.8846	0.7942	0.9365	0.6785	0.9098	0.7249
TabPFN Transformer	0.9669	0.9461	0.9648	0.9733	0.9801	0.9014	0.9773	0.9359

Table 2. Powdery mildew disease classification results. The value of the classifier with the highest value is denoted in bold at every column.

Classifier	Accuracy	ROC-AUC	Precision		Recall		F1 Score
Classifier	Accuracy	ROC-AUC	No	Yes	No	Yes	No	Yes
Logistic Regression	0.5767	0.5917	0.8047	0.3375	0.5665	0.6162	0.6652	0.4462
KNN	0.7731	0.7674	0.8990	0.5509	0.7791	0.7558	0.8348	0.6376
SVM	0.5736	0.6283	0.8482	0.3632	0.5226	0.7441	0.6391	0.4892
Random Forest	0.8190	0.7838	0.8917	0.6423	0.8583	0.7094	0.8747	0.6742
Gradient Boosting	0.7147	0.6383	0.8624	0.8596	0.9436	0.5897	0.9015	0.6953
XGBoost	0.8220	0.7374	0.8527	0.7059	0.9165	0.5681	0.8835	0.6333
CatBoost	0.8344	0.7345	0.8690	0.7162	0.9125	0.6283	0.8902	0.6725
TabPFN Transformer	0.9202	0.8713	0.9212	0.9166	0.9740	0.7973	0.9473	0.8653

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Arvanitis, N.; Graziosi, F.; Athanasiou, G.; Terpou, A.; Arvaniti, O.; Zahariadis, T. Utilizing TabPFN Transformer with IoT Environmental Data for Early Prediction of Grapevine Diseases. AgriEngineering 2025, 7, 173. https://doi.org/10.3390/agriengineering7060173

AMA Style

Arvanitis N, Graziosi F, Athanasiou G, Terpou A, Arvaniti O, Zahariadis T. Utilizing TabPFN Transformer with IoT Environmental Data for Early Prediction of Grapevine Diseases. AgriEngineering. 2025; 7(6):173. https://doi.org/10.3390/agriengineering7060173

Chicago/Turabian Style

Arvanitis, Nikolaos, Filippo Graziosi, Gina Athanasiou, Antonia Terpou, Olga Arvaniti, and Theodore Zahariadis. 2025. "Utilizing TabPFN Transformer with IoT Environmental Data for Early Prediction of Grapevine Diseases" AgriEngineering 7, no. 6: 173. https://doi.org/10.3390/agriengineering7060173

APA Style

Arvanitis, N., Graziosi, F., Athanasiou, G., Terpou, A., Arvaniti, O., & Zahariadis, T. (2025). Utilizing TabPFN Transformer with IoT Environmental Data for Early Prediction of Grapevine Diseases. AgriEngineering, 7(6), 173. https://doi.org/10.3390/agriengineering7060173

Article Menu

Utilizing TabPFN Transformer with IoT Environmental Data for Early Prediction of Grapevine Diseases

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Description

2.1.1. Grapevine Field Description

2.1.2. IoT Environmental Data Acquisition and Labeling

2.2. Data Augmentation and Preprocessing

2.2.1. Gaussian-Copula-Based Synthetic Data Generation

2.2.2. Additive-Gaussian-Noise-Based Synthetic Data Generation

2.2.3. Data Normalization (Standardization)

2.2.4. Data Balancing

2.3. Grapevine Disease Prediction Using ML

2.3.1. Logistic Regression

2.3.2. KNN

2.3.3. Support Vector Machine (SVM)

2.3.4. Random Forest

2.3.5. Gradient Boosting

2.3.6. XGBoost

2.3.7. CatBoost

2.3.8. TabPFN Transformer

2.4. Evaluation Metrics

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI