4.1. Dataset Description and Application Context
The dataset employed in this study comprises high-resolution meteorological and disease outbreak records collected across multiple agroecological zones in China, including Yunnan, Inner Mongolia, Hubei, Guizhou, Gansu, and Chongqing. These regions span a broad climatic gradient—from temperate plateaus to humid subtropical basins—ensuring the inclusion of diverse weather patterns relevant to potato cultivation and the development of late blight. A total of 54 meteorological monitoring stations were distributed across 24 cities of 6 provinces. The geographical distribution of these monitoring sites is illustrated in
Figure 5.
The meteorological data were obtained via authorized access to regional agricultural meteorological monitoring systems maintained by provincial agricultural bureaus. Specifically, QD-3340MV wireless automatic weather stations, manufactured by Beijing Huisi Junda Technology Co., Ltd. (Beijing, China) were deployed at field level in high-risk zones for real-time, multi-variable sensing. These stations are equipped to capture hourly data on temperature (current, maximum, minimum), relative humidity, dew point, pressure, precipitation, wind speed and direction, and sunshine duration. A representative in-field deployment is shown in
Figure 6, where an in-field solar-powered station is accompanied by local field management.
Disease outbreak records were collected from provincial plant protection stations, the trained agronomists conducted field surveys to record late blight occurrence, severity index (0–9), affected acreage, cultivar type, and crop growth stage. These observations were timestamped and manually aligned with the corresponding hourly meteorological data to ensure temporal consistency.
The final dataset covers a continuous period from January 2016 to December 2022, spanning seven growing seasons and including multiple regional epidemic cycles. After preprocessing (deduplication, missing value imputation, and outlier filtering using a 1.5 × IQR rule), the dataset comprised 32,432 hourly records. Approximately 18.7% of these records are labeled as late blight positive (binary label: 1 for outbreak, 0 otherwise). A temporal split was used for model development: 70% of the data (2016–2020) for training, 20% (2022) for validation, and 10% (2022) for testing. This setup enables the model to learn from long-term seasonal patterns while assessing generalization on unseen temporal conditions.
While the current study focuses on retrospective prediction, PLB-GPT is designed for future integration into real-time decision-support platforms. As illustrated in
Figure 6, field-deployed sensing stations already serve as data collection terminals. With minimal infrastructure upgrade, these stations can serve as front-end nodes for mobile- or cloud-based applications powered by PLB-GPT. Such systems would provide farmers with actionable alerts through smartphones or web interfaces, enabling timely intervention. Although deployment on UAVs or agricultural robots is not explored in this work, our modular architecture allows flexible adaptation for such platforms in future extensions.
Table 2 provides representative samples of the hourly meteorological monitoring records used in this study. The examples highlight both ordinary weather conditions without disease occurrence (e.g., 11 June 2020) and an outbreak case (15 July 2020), where high humidity (91%), elevated temperature (24 °C), and sustained precipitation (8.4 mm) coincided with a late blight event. These records illustrate the diversity of variables captured, including temperature, humidity, dew point, pressure, precipitation, and wind characteristics, and their direct association with binary disease labels. The table is not exhaustive but demonstrates the data structure and the type of meteorological–disease correspondences on which PLB-GPT is trained. A preliminary inspection of the dataset suggests that variables such as relative humidity, dew point, and precipitation exhibit the strongest association with late blight outbreaks, whereas temperature and wind conditions interact with these factors to modulate disease risk. A more systematic correlation analysis will be pursued in our future work.
4.3. Results Evaluation
To evaluate the performance of PLB-GPT, the following classification metrics were used: accuracy, precision, recall, and F1-score. Accuracy measures the ratio of correctly predicted instances to the total number of instances. Precision evaluates the proportion of true positive predictions among all predicted positives. Recall reflects the proportion of actual positive cases correctly identified by the model. The F1-score, as the harmonic mean of precision and recall, provides a balanced metric when both false positives and false negatives are costly. These metrics offer a comprehensive view of the model’s predictive performance under various error trade-offs.
It is important to note that the choice of forecast horizon directly shapes the nature and feasibility of agricultural interventions and risk management strategies. Short-term forecasts (e.g., 1-day or 3-day ahead) allow for highly responsive actions with minimal uncertainty, but the limited lead time often restricts farmers to emergency measures—such as spot spraying—that are less efficient for large-scale operations. In contrast, longer horizons (e.g., 7-day ahead) enable proactive planning, including fungicide procurement, labor scheduling, and field preparation, which are essential for cost-effective, preventive disease control.
However, the extended lead time also increases exposure to forecasting uncertainty, potentially resulting in either unnecessary applications or delayed responses. A medium-range horizon (around 5-day) typically offers a practical compromise: it provides sufficient time to organize targeted, preventive fungicide applications while remaining close enough to the event to support reliable decision-making. This balance aligns well with the biological dynamics of potato late blight and supports integrated disease management that optimizes both crop protection and resource use.
4.4. Data Preprocessing
To ensure consistency in the scale and distribution of the input data, instance normalization was applied to all features. This normalization process standardizes each feature by adjusting its mean to zero and its standard deviation to one. By normalization, the model’s stability and accuracy are improved during training, as the data are prepared in a form that facilitates model convergence and generalization.
Figure 7 presents the distribution of key meteorological features after normalization.
The primary meteorological features of this dataset include temperature, humidity, maximum temperature, minimum temperature, pressure, and dew point. After normalization, each feature has a mean of zero and a standard deviation of one, ensuring comparability across different features. This standardization enhances the model’s ability to generalize and improves performance in subsequent analyses.
The length of time window is crucial to mining the temporal patterns for time-series prediction. Shorter windows, such as 1-day or 3-day windows, are suitable for tasks requiring rapid responses to changes in weather, such as short-term weather forecasts. In contrast, longer windows, such as 5-day or 7-day windows, are better suited for capturing long-term trends and seasonal variations, providing insights into broader climate influences on crop growth and disease development. Therefore, to capture the temporal dynamics effectively, we tested the meteorological features under different time windows.
Figure 8 shows the periodic variations in meteorological features with different time window lengths.
1-Day Time Window:
Figure 8a demonstrates the variation of meteorological features within a single day. A clear correlation is observed between temperature and dew point, as an increase in temperature typically accompanies a rise in dew point, indicating higher air moisture content. Humidity fluctuates significantly, especially toward the end of August and early September, where rapid increases are seen. The pressure drops consistently during this period, likely due to rainfall or changes in the weather system.
3-Day Time Window: In
Figure 8b, a 3-day window is used to capture medium-term patterns. The smoothing effect of this window makes long-term trends more evident, reducing daily noise. Temperature and dew point continue to exhibit similar trends, while humidity shows pronounced fluctuations during late August and early September, reflecting a significant weather event, such as heavy rainfall. Pressure follows a downward trend, aligning with increased humidity, suggesting the arrival of a low-pressure system.
5-Day Time Window:
Figure 8c displays data trends over a 5-day window, revealing mid- to long-term climate patterns. The close relationship between temperature and dew point suggests that temperature shifts significantly influence air moisture content. During mid- to late August, both temperature and dew point follow nearly identical trajectories, highlighting their interdependence. Humidity rises sharply toward the end of August, correlating with a drop in pressure, indicating a seasonal change or weather event.
7-Day Time Window: In
Figure 8d, the 7-day window illustrates long-term climate variation. Humidity spikes significantly at the end of August, likely due to the influence of a large-scale weather system such as a typhoon or low-pressure area, which typically brings substantial rainfall and moisture. The corresponding pressure drop supports this hypothesis, indicating the presence of a low-pressure system. The consistent relationship between temperature and dew point further emphasizes the role of temperature in controlling atmospheric moisture levels.
Beyond the qualitative trends observed across different time windows, the feature interactions can be further interpreted from a quantitative correlation perspective. Consistent with the visual analysis, temperature and dew point exhibit a strong positive correlation across all window lengths, confirming their tight coupling in characterizing atmospheric moisture conditions. Humidity shows a moderate positive correlation with precipitation-related patterns, particularly in the 5-day and 7-day windows, reflecting sustained moisture accumulation that is favorable for late blight development. In contrast, atmospheric pressure demonstrates a weaker or negative correlation with moisture-related variables, providing complementary information rather than redundancy. These correlation patterns explain why longer aggregation windows amplify disease-relevant signals and motivate the use of channel-independent processing to preserve informative interactions while mitigating correlated noise.
4.4.1. Comparison of Fine-Tuning Methods
To evaluate the effectiveness of the LP-FT (Linear Probing followed by Full Fine-Tuning) two-stage fine-tuning strategy, a series of experiments were conducted to compare it with the traditional single-stage fine-tuning approach. These experiments were conducted with the primary goal of understanding the differences in model adaptation, convergence speed, and stability, as well as generalization ability, based solely on the training and validation loss.
4.4.2. Linear Probing
In the Potato Late Blight prediction task, linear probing involved fine-tuning only the output layer while keeping all other layers frozen. This approach takes advantage of the pretrained GPT-2 model’s internal representations without altering the majority of the network, allowing for rapid adaptation to the new task. By freezing the layers, the model focuses on tuning only the output weights to match the specifics of the new dataset.
Table 3 outlines the key training parameters for linear probing.
Figure 9 illustrates the changes in training and validation loss during linear probing. The training loss, represented by the solid black line, shows a sharp decline in the first few iterations, reflecting the model’s quick adaptation to the task through adjustments to the output layer. This rapid initial drop is expected, as linear probing only updates a small portion of the model.
In contrast, the validation loss, represented by the gray dashed line, shows a similar rapid decline but also exhibits fluctuations during the mid-stage (iterations 5–7), indicating potential signs of overfitting. This is likely due to the fact that only the output layer is being fine-tuned, leaving the frozen layers unchanged, which limits the model’s ability to generalize effectively to unseen data.
Although linear probing demonstrates rapid convergence, the mid-stage fluctuations in validation loss suggest that this approach may benefit from additional regularization or moving to a more comprehensive fine-tuning strategy to avoid overfitting to the training data. This limitation becomes more pronounced when the task involves complex data interactions, such as those present in the prediction task, where dynamic weather conditions significantly influence disease outbreaks.
4.4.3. Full Fine-Tuning
In the full fine-tuning stage, all layers of the pretrained model are unfrozen, allowing for a more comprehensive optimization of the model’s internal representations. This strategy enables the model to better adapt its internal weights to the specific characteristics of the Potato Late Blight prediction task, which involves intricate relationships between meteorological factors and disease propagation.
Table 4 outlines the key training parameters used during this stage.
Figure 10 shows the training and validation loss during full fine-tuning. The training loss exhibits a sharp decline during the initial iterations (1–5), similar to linear probing, but continues to decrease more gradually as the model fine-tunes its internal parameters throughout the network. This gradual reduction reflects the model’s ability to better capture complex patterns in the data, particularly those related to the dynamic environmental conditions that drive potato late blight outbreaks.
The validation loss also decreases significantly during the initial iterations, following a similar trend to the training loss. While the validation loss stabilizes during the later stages of training, it remains slightly higher than the training loss, which is indicative of good generalization performance. However, occasional fluctuations in the mid-stage (iterations 10–20) suggest that the model is still learning to balance task-specific features with the general pretrained knowledge.
Compared to linear probing, the full fine-tuning shows more gradual and sustained improvements in both training and validation loss, which suggests that it is better suited for tasks requiring deep integration of new task-specific information across all layers of the model.
4.4.4. Two-Stage Fine-Tuning
The two-stage fine-tuning strategy (LP-FT) combines the efficiency of linear probing with the comprehensive adaptation of full fine-tuning. This strategy begins with linear probing, allowing the model to quickly adapt its output layer, followed by full fine-tuning, which fine-tunes all layers to better align with the target task. The two-stage approach aims to strike a balance between fast convergence and effective generalization by progressively introducing complexity to the fine-tuning process.
Table 5 summarizes the parameters for the LP-FT approach.
Figure 11 shows the training and validation loss curves during the two-stage fine-tuning process. In the linear probing stage (iterations 1–5), the training loss rapidly decreases as only the output layer is updated. As the model transitions into the full fine-tuning stage (iterations 6–30), the training loss continues to decrease at a slower rate as the model refines its internal representations. The validation loss follows a similar trend, with slight fluctuations during the transition between the two stages, but ultimately stabilizes, indicating strong generalization performance.
The two-stage fine-tuning approach demonstrates a balance between rapid initial adaptation and deep task-specific refinement. By leveraging both the efficiency of linear probing and the flexibility of full fine-tuning, the LP-FT method results in more stable training and better generalization performance compared to either method in isolation.
4.4.5. Comparison of Fine-Tuning Strategies
Figure 12 provides a comparison of the training and validation loss curves across the three fine-tuning strategies. The two-stage fine-tuning method (LP-FT) achieves the lowest validation loss overall, indicating superior generalization capability. Additionally, the training loss for LP-FT is consistently lower than that for full fine-tuning and linear probing, suggesting that the two-stage process allows the model to converge more effectively by gradually introducing task-specific adjustments.
Overall, the LP-FT approach outperforms both linear probing and full fine-tuning, achieving a more optimal balance between model stability and generalization performance. The combination of rapid initial adaptation and thorough parameter tuning across all layers makes LP-FT an efficient and effective strategy for fine-tuning large pretrained models like GPT-2 for domain-specific tasks.
4.5. Comparison with Baseline Models
For late-blight forecasting, we benchmarked PLB-GPT against four strong baselines:(1) CARAH, (2) ARIMA, (3) LSTM, and (4) Informer, on 4 different forecast horizons (
H = 1-day, 3-day, 5-day, 7-day). Performance was scored with the standard quartet of classification metrics: Accuracy, Precision, Recall, and F1. This cross-horizon evaluation yields a fine-grained picture of each model’s short-term and medium-term predictive capacity, the results are presented in
Table 6.
4.5.1. Performance and Practical Relevance
Table 6 documents a uniform lead for PLB-GPT over four baselines at every forecast horizon
and on every metric (Accuracy, Precision, Recall, F1). Averaged over horizons, PLB-GPT improves Accuracy by
pp on Informer,
pp on LSTM, and
pp on ARIMA. Its Precision margin peaks at the 5-day horizon (0.8915 vs. 0.8025 for the next best deep-learning model, i.e., Informer), indicating substantially fewer false alarms, which is critical when fungicide applications are costly.
Importantly, PLB-GPT maintains high Recall (–) across horizons, demonstrating robustness to the severe class imbalance inherent in disease outbreak data. The F1-score remains above 0.7732 even at the 1-day horizon and reaches 0.8472 at 5-day. This balanced precision–recall profile ensures both low missed-detection risk and minimal unnecessary interventions—key requirements for real-world agricultural decision support.
4.5.2. Robustness, Statistical Significance, and Stability
Late blight outbreaks represent a minority class in the dataset, making imbalance-aware interpretation essential. Under such conditions, Accuracy alone may obscure model behavior, whereas Recall and F1-score provide more informative signals for outbreak detection. Across all forecasting horizons, PLB-GPT maintains consistently higher Recall for the positive class, indicating a reduced risk of missed outbreak events. At the same time, its strong Precision suggests that this gain is not achieved at the cost of excessive false alarms. This balance is particularly important in agricultural practice, where false negatives can lead to severe yield loss, while unnecessary interventions increase economic and environmental costs.
All baselines, especially ARIMA, show monotonic degradation as H grows, reflecting a compounding error in autoregressive roll-outs. In contrast, PLB-GPT holds its scores steady (and even improves from 1-day to 5-day), suggesting that its weather-aware latent representation curbs horizon-specific drift.
We ran paired
t-tests on each metric across the four horizons (
Table 7). Applying a Bonferroni correction for four comparisons (
), PLB-GPT significantly outperforms ARIMA, LSTM, and Informer on
all metrics (
). The mean F1 gain over CARAH (
pp) is practically relevant, though not significant under the corrected threshold (
), echoing CARAH’s strong short-horizon specialization.
Repeating the evaluation with three random data splits and five parameter seeds (not shown for brevity) yields an F1 standard deviation below 0.7 pp for PLB-GPT, versus 1.8–3.2 pp for LSTM and Informer, underscoring better robustness to data and initialization noise.
By integrating Reversible Instance Normalization, a transformer-style temporal encoder, and a two-stage LP → FT fine-tuning protocol, PLB-GPT establishes a new state-of-the-art for short-range and medium-range late-blight forecasting, which balanced a high precision–recall balance with robustness to horizon length, data split, and random seed variation.
4.7. Robustness to Missing and Noisy Inputs
Both sets of experiments underline that (1) RevIN and the temporal encoding layer are critical for coping with distributional drift and capturing multi-scale temporal cues; (2) the two-stage LP-FT schedule yields the best trade-off between training efficiency and predictive performance. Consequently, our final PLB-GPT configuration retains all architectural components and adopts the LP-FT fine-tuning pipeline for subsequent experiments.
To assess the robustness of PLB-GPT under realistic field conditions, we conducted controlled perturbation experiments that simulated missing and noisy meteorological inputs, which commonly arise from sensor malfunction, communication interruption, or measurement noise. Missingness was introduced by randomly masking a proportion of input values across time steps or channels and applying the same imputation strategy used during preprocessing, while noise was simulated via additive Gaussian perturbations scaled by each feature’s standard deviation.
Table 10 reports performance across 1-day, 3-day, 5-day, and 7-day forecasting horizons under varying perturbation levels. Across all horizons, model performance degrades smoothly and monotonically as perturbation severity increases, without abrupt collapse, indicating stable behavior under input uncertainty. Among the evaluated settings, the 5-day horizon exhibits the strongest robustness, maintaining F1-scores above 0.81 even under 20% missingness or a noise level of
, consistent with its favorable trade-off between temporal context and forecasting uncertainty. Metric-wise, Precision remains relatively stable under moderate perturbations, whereas Recall shows a larger but gradual decline, reflecting the higher sensitivity of outbreak detection to incomplete or noisy inputs. Overall, these results demonstrate that PLB-GPT degrades gracefully rather than catastrophically when confronted with imperfect data, supporting its practical reliability in real-world agricultural disease monitoring systems.