Cereal and Rapeseed Yield Forecast in Poland at Regional Level Using Machine Learning and Classical Statistical Models

Okupska, Edyta; Gozdowski, Dariusz; Pudełko, Rafał; Wójcik-Gront, Elżbieta

doi:10.3390/agriculture15090984

Open AccessArticle

Cereal and Rapeseed Yield Forecast in Poland at Regional Level Using Machine Learning and Classical Statistical Models

¹

Seed and Agricultural Farm, “Bovinas” Ltd., Chodow 17, Chodow, 62-652 Poznań, Poland

²

Department of Biometry, Institute of Agriculture, Warsaw University of Life Sciences, Nowoursynowska 159, 02-776 Warsaw, Poland

³

Department of Bioeconomy and Systems Analysis, Institute of Soil Science and Plant Cultivation—State Research Institute (IUNG-PIB), Czartoryskich 8, 24-100 Pulawy, Poland

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(9), 984; https://doi.org/10.3390/agriculture15090984

Submission received: 12 February 2025 / Revised: 17 April 2025 / Accepted: 30 April 2025 / Published: 1 May 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

This study performed in-season yield prediction, about 2–3 months before the harvest, for cereals and rapeseed at the province level in Poland for 2009–2024. Various models were employed, including machine learning algorithms and multiple linear regression. The satellite-derived normalized difference vegetation index (NDVI) and climatic water balance (CWB), calculated using meteorological data, were treated as predictors of crop yield. The accuracy of the models was compared to identify the optimal approach. The strongest correlation coefficients with crop yield were observed for the NDVI at the beginning of March, ranging from 0.454 for rapeseed to 0.503 for rye. Depending on the crop, the highest R² values were observed for different prediction models, ranging from 0.654 for rapeseed based on the random forest model to 0.777 for basic cereals based on linear regression. The random forest model was best for rapeseed yield, while for cereal, the best prediction was observed for multiple linear regression or neural network models. For the studied crops, all models had mean absolute errors and root mean squared errors not exceeding 6 dt/ha, which is relatively small because it is under 20% of the mean yield. For the best models, in most cases, relative errors were not higher than 10% of the mean yield. The results proved that linear regression and machine learning models are characterized by similar predictions, likely due to the relatively small sample size (256 observations).

Keywords:

grain yield; satellite data; remote sensing; random forest; neural networks

1. Introduction

The yield forecast is essential at various levels, from individual fields to the global scale [1]. Forecasting is important for many reasons; on a local scale, it is primarily significant for farmers, while on a regional and global scale, it is crucial for society as a whole due to ensuring food security. Crop forecasting on a regional and global scale is particularly important for staple foods such as cereals, oilseed crops, and other crops subject to international trade and are the basis of food production and consumption in most countries worldwide. In regional crop forecasting, various types of input data are used, including meteorological data such as air temperature and precipitation frequently utilized [2]. These factors affect soil moisture, one of the most critical factors influencing crop yields. Crop forecasting, both at the local and regional levels, often also considers the current state of crops, which can be assessed using remote sensing satellite data, such as vegetation indices, with or without meteorological variables [3,4,5,6]. Various satellite low-resolution (pixel size 250–1000 m) multispectral sensors, including NOAA AVHRR, Terra, and Aqua MODIS, are data sources used for regional and global yield forecasting [7,8,9]. These satellite data are used for the calculation of vegetation indices, which are the input variables for services like ASAP-Anomaly Hotspots of Agricultural Production (https://agricultural-production-hotspots.ec.europa.eu/, 16 January 2025), GLAM/GIMMS Global Agricultural Monitoring (https://glam1.gsfc.nasa.gov/, 16 January 2025), and MARS Monitoring Agricultural ResourceS (https://joint-research-centre.ec.europa.eu/monitoring-agricultural-resources-mars_en, 16 January 2025). For local yield forecasting at the individual field or farm level, moderate resolution (pixel size 10–30 m) imagery is used, such as Sentinel-2 MSI and Landsat 8 OLI [9,10]. Based on the satellite data for yield forecast, various vegetation indices are used as predictors: normalized difference vegetation index—NDVI, enhanced vegetation index—EVI, soil adjusted vegetation index—SAVI, vegetation health index—VCI, and others [11]. The most common vegetation index in studies on yield forecast is the NDVI, which is often used as a predictor of the grain yield of cereals [1]. It is used in univariate analyses where the NDVI is only for one date (e.g., maximal NDVI in the growing season) or multiple dates or together with other variables (other vegetation indices or other variables that characterize environmental conditions) in multivariate analyses.

Various statistical models are used for yield forecasting. Regression with one or more predictors is the most common method in such studies [1]. Various types of regression, including linear and nonlinear models, are applied. Other standard models used for yield prediction are process-based models, usually used for local studies. The most common are DSSAT (Decision Support System for Agrotechnology Transfer) and WOFOST (World Food Studies Simulation Model) [12]. Recently, for yield prediction at different spatial scales, the most commonly applied models are based on machine learning methods [13]. Methods such as neural networks, random forest, image segmentation, Bayesian models, decision and regression trees, and support vector machines are commonly used for yield forecasting [14,15]. With its data-driven approach, machine learning can capture nonlinear relationships between predictors and yield, which allows us to obtain a higher accuracy of early yield prediction compared to the linear trend model [16,17,18]. Compared to classical statistical methods such as regression, the advantage of machine learning models is usually in studies where the sample size is large and nonlinear relationships are observed [19]. However, assessing the potential benefits of using machine learning methods over simpler linear regression models needs to be empirically verified due to problems related to overfitting for small datasets [20,21]. A study by Meroni et al. [18] on cereal yield prediction for provinces of Algeria proved the superiority of machine learning over regression despite the relatively small sample size (ranging from 340 to 408). However, the differences in accuracy between machine learning and benchmark models were often very small. In this study, five standard machine learning algorithms, least absolute shrinkage and selection operator (LASSO), random forest (RF), multi-layer perceptron (MLP), support vector regression with linear and radial basis function kernels (SVR lin and SVR rbf), and gradient boosting (GBR), were applied. Depending on crop species and prediction date, different algorithms were characterized by the highest accuracy of the yield prediction. Another study conducted on a small sample size (n = 80) in which the performance of different machine learning models (DNNs—deep neural networks, SVR, and RF) was studied proved the best yield forecast accuracy using DNNs [22].

In the review study of Clark et al. [5] on the yield forecasting of wheat, barley, and canola, based on multiple studies from 2015 to 2021, machine learning models outperformed process-based models. Still, they did not surpass statistical ones, based on regression models.

Models that used a combination of remotely sensed and agrometeorological data had a similar accuracy, with a mean R² of about 0.70–0.75, to those that used only remotely sensed data [5]. Models in which only agrometeorological data were used as predictors usually had a much lower R² (mean of about 0.55). However, the accuracy of the yield predictions is very wide. It depends on the crop, environmental conditions, other factors, and different models (algorithms and predictors) used for yield forecasting. Because of that, the evaluation of these factors and the comparison of different models are important for specific regional conditions.

This study aims to predict the in-season yield of cereals and rapeseed at the province level in Poland for 2009–2024 using various models, including machine learning models and linear regression. The satellite-derived NDVI and climatic water balance (CBW) based on meteorological data were used as predictors. Comparisons of the accuracy of the studied models were performed to select the optimal model. The novelty of this study is the development of the models for the early prediction (about 2–3 months before the harvest) of crop yield at the province level in Poland using publicly available, easy-to-obtain data as yield predictors. Because of that, other factors such as soil quality, planting varieties, and agricultural management practices were intentionally not included because they are difficult to collect for vast areas during crop vegetation before the harvest.

2. Materials and Methods

2.1. Study Area and Input Data

This study was performed for Poland (Central Eastern Europe) at the province level (16 provinces) from 2009 to 2024. For this period, data for the grain yield of wheat, rye, triticale, barley, and basic cereal species (rye, wheat, oat, barley, and triticale) together with the yield of rapeseed were retrieved from the Central Statistical Office of Poland website (https://bdl.stat.gov.pl, 16 January 2025). Such crops were selected because they are mainly cultivated as winter crops (sowing at the end of summer or early autumn) and are the most important crops in Poland in terms of cultivation area. The typical sowing date of these crops is from the end of August (rapeseed) to the beginning of October (wheat and triticale). In Figure 1, the grain yield of basic cereals is presented for two extreme years in terms of crop yields, i.e., 2018, which was characterized by the lowest yields, and 2022, when the highest yields were observed. The exact figure presents the satellite-derived NDVI for respective years, averaged for cropland area for each province.

The following potential predictors were included in this study: climatic water balance (CWB) for the period August–September for the preceding year (CWB_1), CWB for the period April–May (for the period of intensive vegetation of most crops—CWB_2), and the satellite-derived NDVI for late autumn (October–November) and early spring (March–May) for 8-day periods. These variables were collected for each year separately. CWB is the variable that characterizes the soil’s moisture condition for a certain period. It is evaluated using meteorological data (precipitation, temperature) and soil water capacity. In Poland, CWB is evaluated by the Institute of Soil Science and Plants Cultivation-State Research Institute in Puławy (https://susza.iung.pulawy.pl/en/, 20 January 2025). CWB is calculated as the difference between precipitation in a given period and Penman evapotranspiration in a given period [23]. Higher values of CWB usually indicate better soil conditions, especially in spring or summer, where a water shortage is observed.

The NDVI was acquired as a MOD09Q1 version 6.1 product, which is provided for 8-day periods at 250 m spatial resolution as a gridded Level 3 from the MODIS (Moderate Resolution Imaging Spectroradiometer) sensor from the Terra satellite (https://lpdaac.usgs.gov/products/mod09q1v061/, 20 January 2025). Obtaining frequent and reliable (without cloud cover) satellite images requires frequent revisit time. In the case of MODIS Terra, the revisit time for Poland is 1–2 days, which allows us to obtain sufficient quality imagery and calculation of the NDVI for the entire area of Poland. The following periods were included in the analyses: 10/16–10/23, 10/24–10/31, 11/01–11/08, 11/09–11/16, 11/17–11/24, and 11/25–12/02 for autumn (early growth stage of winter crops) and 03/06–03/13, 03/14–03/21, 03/22–03/29, 03/30–04/06, 04/07–04/14, 04/15–04/22, 04/23–04/30, 05/01–05/08, 05/09–05/16, 05/17–05/24, and 05/25–06/01 for spring. The MODIS Terra revisit time for Poland is 1–2 days, which allows us to obtain sufficient quality imagery and calculation of the NDVI for the entire area of Poland. The satellite data were downloaded from the Google Earth Engine service (https://developers.google.com/earth-engine/datasets/catalog/MODIS_061_MOD09Q1, 20 January 2025). The NDVI raster layers were clipped to the cropland mask (GFSAD30CE-Global Food Security-support Analysis Data Cropland Extent 30 m V001 for nominal year 2015, https://www.usgs.gov/apps/croplands/gfsadce30info), which was obtained from the Earthdata website (https://search.earthdata.nasa.gov/, 20 January 2025). The calculation of the mean NDVI for each period and each province was performed using Python 3.11 scripts in the Google Colab environment (https://colab.research.google.com/, 20 January 2025). Each 8-day NDVI period was recorded as the midpoint of that period, e.g., for the period 03/06–03/13, the NDVI was presented as NDVI 03-09. Geographical data visualization was performed in QGIS 3.40 software (https://www.qgis.org/, 20 January 2025).

2.2. Statistical Data Analysis

The grain yield prediction was based on data from 16 provinces of Poland over 16 years. The total number of observations for each crop was equal to 256.

The first model of yield prediction was based on multiple linear regression, where the following function was applied:

y = b₀ + b₁x₁ + b₂x₂ + … + b_kx_k

where y is the dependent variable; b₀ is the intercept; b₁, b₂, and b₃ are the regression coefficients for the predictors x₁, x₂,…, x_k; and e is the random error term.

The analysis of regression was performed using two methods. The first was based on the total dataset, and the second was performed on a randomly selected 80% of the dataset. The second type of analysis was performed 100 times, and statistics of the fitting accuracy of the models were averaged.

Two machine learning models were applied as multiple linear regressions for the same predictors and dependent variables. The first model was based on random forest with hyperparameters optimized through RandomizedSearchCV and GridSearchCV for improved performance and accuracy. The dataset was first split into training (80%) and testing sets (20%), and categorical variables were encoded.

The second model was based on a deep neural network (DNN). The model consists of several fully connected layers (Dense), which are characteristic of DNNs. It also uses activation functions (ReLU) and the Adam optimizer, standard in deep learning models. The network consisted of three layers: the first with 64 neurons, the second with 32 neurons, and the final output layer with 1 neuron. Additionally, dropout regularization was used to prevent overfitting. For each dependent variable, 100 training iterations were performed, and the best-performing model was selected based on the highest R² score.

These three methods were applied because they are the most common in such studies [1,5].

For all models, performance was evaluated using metrics like mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R²).

3. Results

3.1. Basic Statistics of Input Data

The predictors used for the analyses (NDVI and CWB for two periods) were characterized by higher variability across years within the same province than across provinces across the same year (Table 1 and Table 2). This means that the variability of the predictors is higher across the provinces, and because of that, in further analyses, the province is treated as a categorical variable. The difference between the provinces is due to different environmental conditions, including the shape of the surface, soil types, amount of rainfall, length of the growing season, and other factors. The variability of dependent variables, i.e., crop yields, differed according to the crop species. The highest relative variability across the provinces, measured as the ratio of standard deviation to the mean, was observed for wheat. In contrast, the lowest relative variability across provinces was observed for rapeseed yield. The variability of crop yields across years, within each province, was similar for all the studied crops and usually slightly smaller than variability across provinces within the years.

3.2. Correlation Coefficients Between Predictors and Crop Yields

Between predictors (CWB_1, CWB_2, and NDVI for each period) and crop yields, correlation coefficients were calculated separately for each province, and mean correlations were calculated (Table 3). Most of the correlations were positive, but only some were statistically significant. The strongest correlations were observed between crop yield and the NDVI for the beginning of March (9th of March); the correlation coefficients ranged from 0.454 for rapeseed to 0.503 for rye. Correlations were stronger for cereals in comparison to rapeseed. This indicates that the NDVI can better predict grain yield for cereals than for rapeseed. The mean correlations between crop yields and CWB (climatic water balance for spring) were weak. However, the CWB and crop yield correlations were significant for some provinces (results presented in Supplementary Materials in Tables S1–S16). Significant positive correlations between CWB_2 and crop yield of cereals were observed for Lubusz province (r in the range from 0.54 to 0.71) and West Pomeranian province for wheat (r = 0.64) and basic cereals (r = 0.54). This indicates that in these two provinces, a water shortage in early spring can cause a significant decrease in crop yield. A significant but negative correlation was observed between CWB_1 and rapeseed yield in the Lesser Poland province (r = −0.51), indicating that excess precipitation in early spring in this region can cause a decrease in rapeseed yield.

3.3. Results of Crop Yield Prediction Based on Linear Regression and Machine Learning Models

Four types of analyses were performed for all crops, i.e., multiple linear regression using all datasets, multiple linear and training (80% observations) and test (20% observations) sets, random forest, and deep neural networks. In all these methods, the predictors were CWB_1, CWB_2, and satellite-derived NDVI for 8-day periods. The dependent variable was crop yield, which was expressed in dt/ha. For each method, statistical parameters of prediction accuracy were calculated: R²—coefficient of determination; MSE—mean squared error; MAE—mean absolute error; and RMSE—root mean squared error (Table 4). The prediction accuracy for various models and all of the studied crops is presented in Figure 2. Linear regression for comparison with RF and DNN was performed using the training set (80%) and the test set (20%), but the method was applied 100 times to calculate the average, since it is impossible to present one test set in the charts. Because of that, only the results of the linear regression performed on the total dataset were presented in Figure 2. The highest R² for most crops was observed for linear regression performed on the full dataset. It ranged from 0.421 for rapeseed to 0.777 for basic cereals. Between cereal species, the highest R² (0.754) was observed for wheat. Coefficients of regression for yield prediction based on multiple linear regression with a stepwise selection of the predictors are presented in Table 5, which allows us to predict the crop yields using CWB_1, CWB_2, and NDVI for 8-day periods for each province.

The calculation of multiple regression based on training and test sets (80/20%) was performed to ensure the comparability of the results with random forest and neural network models, which were performed using training and test sets (80/20%). The highest average R² was observed for neural networks (average R² = 0.635) and the lowest for linear regression based on the 80% training set test set (average R² = 0.557). Random forest was characterized by medium accuracy (average R² = 0.568). The highest R² for cereals, depending on crop species, was observed for linear regression or neural networks, while for rapeseed, the highest R² was observed for random forests. The value of R² is related to other parameters of model accuracy, i.e., the higher R² is, the lower MAE and RMSE are. For example, the lowest MAE for wheat was observed for multiple linear regression based on the full dataset, and it was equal to 2.94 dt/ha. The mean grain yield of wheat was about 45 dt/ha, which means that the mean absolute error was about 6.5% of the mean yield, which is a good prediction. In the case of random forest, the MAE for wheat was equal to about 4.64 dt/ha, and for neural networks, it was equal to about 3.42 dt/ha. These values are higher but still relatively low in relation to the mean grain yield because they are less than 11% of the mean yield. In the case of the other studied crops, all models were characterized by MAE and RMSE not exceeding 6 dt/ha, which means that these errors do not exceed, in most cases, 20% of the mean yield. The prediction results are very promising, indicating that classical models or models based on machine learning (random forest and deep neural networks) are characterized by better prediction depending on the crop. Such results are probably because of a relatively low sample size (256 observations), which is insufficient for developing highly optimized final machine learning models.

4. Discussion

The results obtained in this study are highly promising because data for the predictors used in the analyses can be collected in 2–3 months before the harvest, which takes place in Poland usually from mid-July to mid-August. Such a prediction allows for crop yield forecasting with an MAE of about 3 dt/ha and an RMSE of about 3–4 dt/ha. The prediction error is usually less than 10% of the mean crop yield. The results proved that despite climate changes in the environmental conditions of central Europe, the crop yield of winter cereals and rapeseed depends strongly on the conditions after sowing and in early spring. The weather conditions in advanced growth stages do not strongly impact crop yield in winter crop conditions. This can be explained by the acceleration of vegetation and the shorter duration of individual development phases at increasingly higher temperatures [24,25]. Our study was performed at the province level, where areas are quite large, and the effects of extreme events (e.g., hail showers or extreme drought) do not occur in large areas. The effects of extreme events should be included in yield forecasting in countries where such events are very frequent [26]. Central Europe, including Poland, is where extreme weather conditions, such as long-term drought, rarely occur [27]. This makes it easier to forecast yields because yields in different years in the same area are quite similar, especially if we consider a fairly large area, eliminating extreme local variability. In general, lower spatial resolution eliminates local errors and decreases the bias of the yield forecast [28,29].

In our study, a similar yield prediction in the case of most crops was obtained by multiple linear regression to machine learning models such as random forest and deep neural networks. This was probably because of the relatively small dataset (N = 256). Our study’s results are consistent in some parts with Schauberger et al.’s systematic review [1], in which statistical models were characterized by better forecast performance (higher R²) than machine learning models. In the study of Ansarifar et al. [30], various statistical and machine learning models were applied for the yield prediction of maize and soybean based on a set of predictors that characterize various meteorological and soil conditions. This study’s various models were compared using RMSE for crop yield. Machine learning models such as random forest and neural networks had a smaller RMSE than linear regression. However, other regression models, e.g., stepwise or ridge regression, have values similar to those of the RMSE in ML models.

In the study of Lee et al. [31], various statistical and machine learning models were evaluated for the prediction of the canopy nitrogen weight of corn, where predictors were UAV-derived vegetation indices. The highest accuracy expressed as R² was obtained for linear regression, which was higher than machine learning models such as random forest and support vector machine. The worse accuracy of ML models compared to linear regression was probably because of the small sample size (n = 63 for calibration and n = 28 for validation).

Another study where linear regression models are compared with machine learning models (random forest) for crop yield prediction is the study of Killen et al. [32]. In this study, UAV-derived indices were included as a predictor of the grain yield of maize. Most of the results in this study proved that yield prediction using random forest was better than prediction based on linear regression. However, linear regression had better spatial generalizability than random forest, and random forest was likely overfitting the data.

In the study of Schwalbert et al. [33], various models for the yield prediction of soybean were evaluated, where predictors were satellite-derived variables, including vegetation indices. The lowest mean absolute error was observed for neural network models. The exception was the earliest growth stage, during which the best prediction was observed for multivariate linear regression.

One of the risks of using machine learning in crop yield prediction is overfitting the models. Overfitting occurs when a model fits the training data too well, including its noise and outliers, resulting in poor performance on new data. Because of that, machine learning algorithms may require larger amounts of data to avoid overfitting [20]. The development of optimal prediction models should avoid both underfit (high bias and high variance) and overfit (low bias and high variance) to obtain good model prediction capability [21].

Most studies in that systematic review lack an out-of-sample performance assessment, where a forecasting method is trained on a subset of data and applied to a test set without further calibration. Because of that, coefficients of determination (R²) cannot be comparable between different studies. Independent test sets are more common for machine learning than classic statistical methods. To maintain comparability in our study, all three models (linear regression, random forest, and deep neural networks) were trained and tested on the same sets (80/20%).

Another difficulty in comparing the results from different studies is the spatial resolution at which the yield prediction was performed. In studies performed at the individual field or farm level, R² for the model is usually lower in comparison to studies performed at the regional or country level [34,35,36,37].

Furthermore, the studies on yield forecasts are based on various types of data. One of the important issues is when the forecasting is performed. Usually, better performance of the models is obtained in the late season, i.e., not later than 4 weeks before harvest, than in the mid-season, i.e., 8 or more weeks before the harvest [1,37,38,39,40]. Quite good forecasts are obtained up to about 4 months before harvest, while a longer time to harvest usually results in unreliable forecasts, i.e., R² substantially below 0.5.

An important issue is the selection of predictors used in forecasting. In our study, only three predictors were used, characterizing weather conditions from sowing (CWB) and the most recent state of crops (NDVI). This is undoubtedly an advantage, as obtaining these input data is easy and does not require additional actions, as they are available based on free weather data and satellite images. Yield prediction accuracy can vary depending on the input data. In the systematic review of Schauberger et al. [1], the highest R² was obtained where ground measurements and remote sensing data were used as predictors. Some studies proved high model accuracy using only remote sensing data or together with meteorological variables [5,41,42].

Machine learning models from 2020 have become the most widely used method for yield forecasting [5,40]. Among many various machine learning methods, artificial neural networks and random forests are most commonly used for yield forecasting. These methods often provide better results than classical statistical models, such as linear regression. However, this is not always the case, as classical statistical models provide better predictions, especially in the case of smaller datasets, which was the case in our research.

5. Conclusions

The results of this study are positive because they proved the possibility of yield forecasting about 2–3 months before the harvest using publicly available, easy-to-obtain input data, i.e., CWB and satellite-derived NDVI for selected dates in early growth stages of the crops. The correlation between predictors and crop yield was positive, with the strongest relationships observed for the NDVI at the beginning of March. Multiple linear regression and deep neural networks produced the highest R² for cereal crops. Random forest performed the best for only rapeseed. All models showed relatively low errors, indicating good predictive accuracy. These findings suggest that depending on the crop, different types of prediction models are characterized by the highest accuracy. The selection of an optimal model should take into account the type of crop and the area for which the forecast is performed. Nevertheless, some limitations should be considered, including the relatively small sample size, the omission of agronomic factors such as soil quality and management practices, and the potential limited transferability of the models to other regions with different climatic conditions.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/agriculture15090984/s1, Table S1: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09) for Lower Silesian province; Table S2: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09) for Kuyavian–Pomeranian province; Table S3: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09) for Lublin province; Table S4: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09) for Lubusz province, Table S5: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for the period 03/06–03/13, NDVI was presented as NDVI 03-09) for Łódź province; Table S6: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09) for Lesser Poland province; Table S7: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for the period 03/06–03/13, NDVI was presented as NDVI 03-09) for Masovian province; Table S8: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09) for Opole province; Table S9: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09) for Subcarpathian province; Table S10: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09) for Podlaskie province; Table S11: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09) for the Pomeranian province; Table S12: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09) for Silesian province; Table S13: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for period, 03/06–03/13 NDVI was presented as NDVI 03-09) for Holy Cross province; Table S14: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09) for Warmian–Masurian province; Table S15: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09) for Greater Poland province; Table S16: Correlation coefficients between crop yields and predictors included in this study (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09) for West Pomeranian province.

Author Contributions

Conceptualization, E.W.-G. and E.O.; methodology, E.W.-G. and E.O.; software, E.W.-G.; validation, E.W.-G. and E.O.; formal analysis, E.W.-G. and E.O.; investigation, E.O. and E.W.-G.; resources, E.O., R.P. and E.W.-G.; data curation, E.O., R.P., D.G. and E.W.-G.; writing—original draft preparation, E.O. and E.W.-G.; writing—review and editing, E.O., R.P. and E.W.-G.; visualization, E.O. and E.W.-G.; supervision, E.W.-G.; project administration, E.W.-G.; funding acquisition, E.W.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

Author Edyta Okupska was employed by the company Seed and Agricultural Farm, “Bovinas” Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CWB	climatic water balance
DNNs	deep neural networks
LASSO	least absolute shrinkage and selection operator
MAE	mean absolute error
MLP	multi-layer perceptron
MODIS	Moderate Resolution Imaging Spectroradiometer
MSE	mean squared error
NDVI	normalized difference vegetation index
NNs	neural networks
RF	random forest
RMSE	root mean squared error
SVR	support vector regression

References

Schauberger, B.; Jägermeyr, J.; Gornott, C. A Systematic Review of Local to Regional Yield Forecasting Approaches and Frequently Used Data Resources. Eur. J. Agron. 2020, 120, 126153. [Google Scholar] [CrossRef]
Lecerf, R.; Ceglar, A.; López-Lozano, R.; Van Der Velde, M.; Baruth, B. Assessing the Information in Crop Model and Meteorological Indicators to Forecast Crop Yield over Europe. Agric. Syst. 2019, 168, 191–202. [Google Scholar] [CrossRef]
Zhao, Y.; Potgieter, A.B.; Zhang, M.; Wu, B.; Hammer, G.L. Predicting Wheat Yield at the Field Scale by Combining High-Resolution Sentinel-2 Satellite Imagery and Crop Modelling. Remote Sens. 2020, 12, 1024. [Google Scholar] [CrossRef]
Lungu, O.N.; Chabala, L.M.; Shepande, C. Satellite-Based Crop Monitoring and Yield Estimation—A Review. J. Agric. Sci. 2020, 13, 180. [Google Scholar] [CrossRef]
Clark, R.; Dahlhaus, P.; Robinson, N.; Larkins, J.; Morse-McNabb, E. Matching the Model to the Available Data to Predict Wheat, Barley, or Canola Yield: A Review of Recently Published Models and Data. Agric. Syst. 2023, 211, 103749. [Google Scholar] [CrossRef]
Panek, E.; Gozdowski, D. Analysis of Relationship between Cereal Yield and NDVI for Selected Regions of Central Europe Based on MODIS Satellite Data. Remote Sens. Appl. Soc. Environ. 2020, 17, 100286. [Google Scholar] [CrossRef]
Schut, A.G.T.; Stephens, D.J.; Stovold, R.G.H.; Adams, M.; Craig, R.L. Improved Wheat Yield and Production Forecasting with a Moisture Stress Index, AVHRR and MODIS Data. Crop Pasture Sci. 2009, 60, 60. [Google Scholar] [CrossRef]
Franch, B.; Vermote, E.; Roger, J.-C.; Murphy, E.; Becker-Reshef, I.; Justice, C.; Claverie, M.; Nagol, J.; Csiszar, I.; Meyer, D.; et al. A 30+ Year AVHRR Land Surface Reflectance Climate Data Record and Its Application to Wheat Yield Monitoring. Remote Sens. 2017, 9, 296. [Google Scholar] [CrossRef]
Manafifard, M.; Huang, J. A Comprehensive Review on Wheat Yield Prediction Based on Remote Sensing. Multimed. Tools Appl. 2024, 84, 1–74. [Google Scholar] [CrossRef]
Bazzi, H.; Ciais, P.; Makowski, D.; Baghdadi, N. Advancing Winter Wheat Yield Anomaly Prediction with High-Resolution Satellite-Based Gross Primary Production. One Earth 2024, 8, 101146. [Google Scholar] [CrossRef]
Pham, H.T.; Awange, J.; Kuhn, M.; Nguyen, B.V.; Bui, L.K. Enhancing Crop Yield Prediction Utilizing Machine Learning on Satellite-Based Vegetation Health Indices. Sensors 2022, 22, 719. [Google Scholar] [CrossRef] [PubMed]
Gavasso-Rita, Y.L.; Papalexiou, S.M.; Li, Y.; Elshorbagy, A.; Li, Z.; Schuster-Wallace, C. Crop Models and Their Use in Assessing Crop Production and Food Security: A Review. Food Energy Secur. 2024, 13, e503. [Google Scholar] [CrossRef]
Muruganantham, P.; Wibowo, S.; Grandhi, S.; Samrat, N.H.; Islam, N. A Systematic Literature Review on Crop Yield Prediction with Deep Learning and Remote Sensing. Remote Sens. 2022, 14, 1990. [Google Scholar] [CrossRef]
Bharadiya, J.P.; Tzenios, N.T.; Reddy, M. Predicting Crop Yield Using Deep Learning and Remote Sensing. J. Eng. Res. Rep. 2023, 24, 29–44. [Google Scholar] [CrossRef]
Panigrahi, B.; Kathala, K.C.R.; Sujatha, M. A Machine Learning-Based Comparative Approach to Predict the Crop Yield Using Supervised Learning With Regression Models. Procedia Comput. Sci. 2023, 218, 2684–2693. [Google Scholar] [CrossRef]
Paudel, D.; Boogaard, H.; De Wit, A.; Van Der Velde, M.; Claverie, M.; Nisini, L.; Janssen, S.; Osinga, S.; Athanasiadis, I.N. Machine Learning for Regional Crop Yield Forecasting in Europe. Field Crop. Res. 2022, 276, 108377. [Google Scholar] [CrossRef]
Paudel, D.; Boogaard, H.; De Wit, A.; Janssen, S.; Osinga, S.; Pylianidis, C.; Athanasiadis, I.N. Machine Learning for Large-Scale Crop Yield Forecasting. Agric. Syst. 2021, 187, 103016. [Google Scholar] [CrossRef]
Hashemi, M.; Yost, M.; Holt, J. Field-Scale Evaluation of Low-Elevation and Mobile Drip Irrigation Systems. 2024. Available online: https://www.sciencedirect.com/science/article/pii/S0378377425002161?ssrnid=4937116&dgcid=SSRN_redirect_SD (accessed on 20 January 2025).
Meroni, M.; Waldner, F.; Seguini, L.; Kerdiles, H.; Rembold, F. Yield Forecasting with Machine Learning and Small Data: What Gains for Grains? Agric. For. Meteorol. 2021, 308–309, 108555. [Google Scholar] [CrossRef]
Morales, A.; Villalobos, F.J. Using Machine Learning for Crop Yield Prediction in the Past or the Future. Front. Plant Sci. 2023, 14, 1128388. [Google Scholar] [CrossRef]
Islam, M.; Shehzad, F. A Prediction Model Optimization Critiques through Centroid Clustering by Reducing the Sample Size, Integrating Statistical and Machine Learning Techniques for Wheat Productivity. Scientifica 2022, 2022, 7271293. [Google Scholar] [CrossRef]
Dang, C.; Liu, Y.; Yue, H.; Qian, J.; Zhu, R. Autumn Crop Yield Prediction Using Data-Driven Approaches:- Support Vector Machines, Random Forest, and Deep Neural Network Methods. Can. J. Remote Sens. 2021, 47, 162–181. [Google Scholar] [CrossRef]
Jędrejek, A.; Koza, P.; Doroszewski, A.; Pudełko, R. Agricultural Drought Monitoring System in Poland—Farmers’ Assessments vs. Monitoring Results (2021). Agriculture 2022, 12, 536. [Google Scholar] [CrossRef]
Olesen, J.E.; Børgesen, C.D.; Elsgaard, L.; Palosuo, T.; Rötter, R.P.; Skjelvåg, A.O.; Peltonen-Sainio, P.; Börjesson, T.; Trnka, M.; Ewert, F.; et al. Changes in Time of Sowing, Flowering and Maturity of Cereals in Europe under Climate Change. Food Addit. Contam. Part A 2012, 29, 1527–1542. [Google Scholar] [CrossRef] [PubMed]
Thaler, S.; Eitzinger, J.; Trnka, M.; Dubrovsky, M. Impacts of Climate Change and Alternative Adaptation Options on Winter Wheat Yield and Water Productivity in a Dry Climate in Central Europe. J. Agric. Sci. 2012, 150, 537–555. [Google Scholar] [CrossRef]
Pagani, V.; Guarneri, T.; Fumagalli, D.; Movedi, E.; Testi, L.; Klein, T.; Calanca, P.; Villalobos, F.; Lopez-Bernal, A.; Niemeyer, S.; et al. Improving Cereal Yield Forecasts in Europe—The Impact of Weather Extremes. Eur. J. Agron. 2017, 89, 97–106. [Google Scholar] [CrossRef]
Kuśmierek-Tomaszewska, R.; Żarski, J. Assessment of Meteorological and Agricultural Drought Occurrence in Central Poland in 1961–2020 as an Element of the Climatic Risk to Crop Production. Agriculture 2021, 11, 855. [Google Scholar] [CrossRef]
Wit, A.J.W.D.; Boogaard, H.L.; Diepen, C.A.V. Spatial Resolution of Precipitation and Radiation: The Effect on Regional Crop Yield Forecasts. Agric. For. Meteorol. 2005, 135, 156–168. [Google Scholar] [CrossRef]
Folberth, C.; Yang, H.; Wang, X.; Abbaspour, K.C. Impact of Input Data Resolution and Extent of Harvested Areas on Crop Yield Estimates in Large-Scale Agricultural Modeling for Maize in the USA. Ecol. Model. 2012, 235–236, 8–18. [Google Scholar] [CrossRef]
Ansarifar, J.; Wang, L.; Archontoulis, S.V. An Interaction Regression Model for Crop Yield Prediction. Sci. Rep. 2021, 11, 17754. [Google Scholar] [CrossRef]
Lee, H.; Wang, J.; Leblon, B. Using Linear Regression, Random Forests, and Support Vector Machine with Unmanned Aerial Vehicle Multispectral Images to Predict Canopy Nitrogen Weight in Corn. Remote Sens. 2020, 12, 2071. [Google Scholar] [CrossRef]
Killeen, P.; Kiringa, I.; Yeap, T.; Branco, P. Corn Grain Yield Prediction Using UAV-Based High Spatiotemporal Resolution Imagery, Machine Learning, and Spatial Cross-Validation. Remote Sens. 2024, 16, 683. [Google Scholar] [CrossRef]
Schwalbert, R.A.; Amado, T.; Corassa, G.; Pott, L.P.; Prasad, P.V.V.; Ciampitti, I.A. Satellite-Based Soybean Yield Forecast: Integrating Machine Learning and Weather Data for Improving Crop Yield Prediction in Southern Brazil. Agric. For. Meteorol. 2020, 284, 107886. [Google Scholar] [CrossRef]
Fajardo, M.; Whelan, B.M. Within-farm Wheat Yield Forecasting Incorporating Off-farm Information. Precis. Agric. 2021, 22, 569–585. [Google Scholar] [CrossRef]
Cao, J.; Zhang, Z.; Luo, Y.; Zhang, L.; Zhang, J.; Li, Z.; Tao, F. Wheat Yield Predictions at a County and Field Scale with Deep Learning, Machine Learning, and Google Earth Engine. Eur. J. Agron. 2021, 123, 126204. [Google Scholar] [CrossRef]
Filippi, P.; Jones, E.J.; Wimalathunge, N.S.; Somarathna, P.D.S.N.; Pozza, L.E.; Ugbaje, S.U.; Jephcott, T.G.; Paterson, S.E.; Whelan, B.M.; Bishop, T.F.A. An Approach to Forecast Grain Crop Yield Using Multi-Layered, Multi-Farm Data Sets and Machine Learning. Precis. Agric. 2019, 20, 1015–1029. [Google Scholar] [CrossRef]
Filippi, P.; Han, S.Y.; Bishop, T.F.A. On Crop Yield Modelling, Predicting, and Forecasting and Addressing the Common Issues in Published Studies. Precis. Agric. 2025, 26, 8. [Google Scholar] [CrossRef]
Basso, B.; Liu, L. Seasonal Crop Yield Forecast: Methods, Applications, and Accuracies. In Advances in Agronomy; Elsevier: Amsterdam, The Netherlands, 2019; Volume 154, pp. 201–255. ISBN 978-0-12-817406-7. [Google Scholar]
Lehmann, J.; Kretschmer, M.; Schauberger, B.; Wechsung, F. Potential for Early Forecast of Moroccan Wheat Yields Based on Climatic Drivers. Geophys. Res. Lett. 2020, 47, e2020GL087516. [Google Scholar] [CrossRef]
Leukel, J.; Zimpel, T.; Stumpe, C. Machine Learning Technology for Early Prediction of Grain Yield at the Field Scale: A Systematic Review. Comput. Electron. Agric. 2023, 207, 107721. [Google Scholar] [CrossRef]
Peng, D.; Cheng, E.; Feng, X.; Hu, J.; Lou, Z.; Zhang, H.; Zhao, B.; Lv, Y.; Peng, H.; Zhang, B. A Deep–Learning Network for Wheat Yield Prediction Combining Weather Forecasts and Remote Sensing Data. Remote Sens. 2024, 16, 3613. [Google Scholar] [CrossRef]
Zhou, W.; Liu, Y.; Ata-Ul-Karim, S.T.; Ge, Q.; Li, X.; Xiao, J. Integrating Climate and Satellite Remote Sensing Data for Predicting County-Level Wheat Yield in China Using Machine Learning Methods. Int. J. Appl. Earth Obs. Geoinf. 2022, 111, 102861. [Google Scholar] [CrossRef]

Figure 1. Grain yield (in dt/ha) of basic cereal species in provinces of Poland in 2018 (the year with the lowest grain yield) and 2022 (the year with the highest grain yield) and mean NDVI at the beginning of March (03/06–03/13) for cropland for each province for the same years. Maps on the right side present the location of Poland in Europe and its geographical coordinates.

Figure 2. Observed and predicted crop yield (dt/ha) using various models separately for each crop. The results of linear regression are based on the total dataset since there are no red dots for the test sets.

Table 1. Means for the variables used for the analyses. The statistics were calculated separately for each year across provinces. CWB is presented in mm, crop yield in dt/ha, and values of NDVI are presented for 8-day NDVI periods as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09.

Year	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022	2023	2024
CWB_1	0.8	−90.0	83.5	−80.7	−48.2	−23.2	−4.7	−143.5	−83.7	26.9	−119.4	−38.5	−54.7	30.6	−21.6	−68.0
CWB_2	−120.4	54.3	−110.3	−95.7	−16.2	−6.2	−72.4	−77.6	−46.4	−155.0	−53.3	−97.0	−38.4	−72.2	−89.4	−121.5
Basic cereals	35.0	35.5	34.0	35.9	37.0	42.0	38.0	38.4	40.9	33.8	36.6	45.1	43.1	46.4	46.2	44.3
Wheat	40.6	41.9	39.7	40.5	43.1	48.2	44.0	43.3	46.8	39.0	42.1	51.2	48.6	51.9	52.2	50.2
Rye	27.4	27.5	25.6	28.7	29.1	32.2	29.1	30.0	31.5	26.0	28.4	35.2	33.9	35.3	35.5	35.0
Triticale	35.1	33.6	33.0	34.3	35.9	39.9	35.4	36.7	39.0	32.3	35.0	44.0	42.3	44.5	44.6	43.1
Barley	33.5	34.5	32.3	35.0	35.1	39.8	34.5	36.6	38.5	31.0	33.6	42.6	40.3	42.8	42.9	41.8
Rapeseed	29.2	22.7	22.6	25.8	28.2	33.5	27.7	26.5	29.3	25.5	27.0	31.6	32.1	33.6	33.8	32.0
NDVI 10-19	0.54	0.36	0.50	0.57	0.56	0.51	0.55	0.32	0.36	0.55	0.51	0.55	0.52	0.46	0.51	0.59
NDVI 10-27	0.50	0.40	0.46	0.53	0.40	0.52	0.58	0.51	0.40	0.39	0.56	0.56	0.51	0.53	0.53	0.51
NDVI 11-04	0.46	0.51	0.32	0.51	0.41	0.33	0.54	0.43	0.37	0.52	0.54	0.49	0.50	0.52	0.56	0.58
NDVI 11-12	0.51	0.34	0.40	0.50	0.47	0.42	0.28	0.24	0.44	0.37	0.46	0.39	0.48	0.53	0.41	0.50
NDVI 11-20	0.50	0.45	0.28	0.43	0.42	0.34	0.35	0.41	0.45	0.46	0.52	0.53	0.48	0.35	0.40	0.35
NDVI 11-27	0.36	0.42	0.25	0.45	0.44	0.50	0.36	0.47	0.29	0.32	0.49	0.50	0.43	0.40	0.18	0.14
NDVI 03-09	0.18	0.17	0.29	0.33	0.33	0.43	0.40	0.18	0.25	0.28	0.39	0.49	0.37	0.40	0.33	0.50
NDVI 03-17	0.31	0.15	0.31	0.33	0.08	0.44	0.42	0.39	0.31	0.19	0.42	0.51	0.29	0.39	0.43	0.40
NDVI 03-25	0.21	0.35	0.33	0.35	0.05	0.51	0.41	0.41	0.42	0.33	0.46	0.50	0.41	0.39	0.44	0.51
NDVI 04-02	0.43	0.41	0.35	0.33	0.04	0.52	0.29	0.40	0.48	0.39	0.49	0.50	0.46	0.32	0.43	0.49
NDVI 04-10	0.48	0.45	0.38	0.40	0.17	0.51	0.47	0.40	0.55	0.41	0.54	0.52	0.44	0.45	0.49	0.57
NDVI 04-18	0.55	0.52	0.47	0.42	0.41	0.60	0.50	0.54	0.43	0.55	0.54	0.54	0.50	0.41	0.60	0.45
NDVI 04-26	0.58	0.52	0.56	0.52	0.47	0.63	0.57	0.53	0.44	0.60	0.58	0.55	0.55	0.51	0.63	0.64
NDVI 05-04	0.61	0.47	0.61	0.57	0.62	0.64	0.62	0.61	0.54	0.65	0.60	0.52	0.50	0.59	0.64	0.65
NDVI 05-12	0.65	0.49	0.63	0.63	0.67	0.57	0.65	0.64	0.65	0.68	0.54	0.63	0.62	0.63	0.64	0.67
NDVI 05-20	0.67	0.60	0.70	0.71	0.69	0.73	0.66	0.69	0.67	0.72	0.67	0.67	0.60	0.65	0.66	0.67
NDVI 05-28	0.73	0.68	0.73	0.70	0.63	0.71	0.68	0.70	0.71	0.71	0.74	0.66	0.62	0.63	0.69	0.69

Table 2. Means for the variables used for the analyses. The statistics were calculated separately for each province across the years. CWB is presented in mm, crop yield in dt/ha, and values of NDVI are presented for 8-day NDVI periods as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09.

Province	Lower Silesian	Kuyavian–Pomeranian	Lublin	Lubusz	Łódź	Lesser Poland	Masovian	Opole	Subcarpathian	Podlaskie	Pomeranian	Silesian	Holy Cross	Warmian–Masurian	Greater Poland	West Pomeranian
CWB_1	−27.8	−63.6	−51.3	−57.9	−56.4	9.6	−51.0	−36.4	−28.5	−41.3	−31.7	−10.2	−34.6	−41.4	−68.8	−43.3
CWB_2	−67.6	−98.3	−71.4	−92.6	−82.3	−2.2	−77.7	−5–8.1	−31.8	−78.5	−94.1	−32.2	−56.2	−86.3	−97.1	−91.2
Basic cereals	46.6	42.6	40.2	38.2	34.3	37.2	31.1	54.7	35.0	31.1	42.9	38.7	32.3	41.9	41.3	44.2
Wheat	50.4	48.5	46.8	44.1	40.8	40.3	37.5	60.4	38.3	35.6	53.4	45.1	35.4	47.8	48.6	50.5
Rye	34.7	31.2	28.9	31.0	27.4	29.9	25.8	39.8	27.7	26.4	31.2	29.0	25.2	32.9	31.7	37.7
Triticale	41.6	42.4	35.9	40.0	37.5	32.9	33.0	46.9	32.1	33.7	37.2	36.5	31.7	40.6	43.7	42.8
Barley	43.1	37.8	37.3	36.4	33.4	36.5	31.8	49.0	33.9	32.0	36.5	36.6	31.8	35.9	40.8	41.9
Rapeseed	28.8	29.4	28.3	27.9	27.5	30.1	27.9	32.1	27.1	30.1	30.4	28.9	26.0	27.8	29.8	28.9
NDVI 10-19	0.49	0.45	0.47	0.51	0.47	0.55	0.48	0.49	0.53	0.49	0.52	0.50	0.51	0.54	0.45	0.52
NDVI 10-27	0.48	0.50	0.45	0.53	0.50	0.51	0.49	0.46	0.48	0.46	0.52	0.48	0.48	0.52	0.50	0.53
NDVI 11-04	0.47	0.44	0.47	0.49	0.50	0.51	0.48	0.50	0.50	0.42	0.45	0.50	0.47	0.46	0.47	0.46
NDVI 11-12	0.49	0.35	0.39	0.46	0.43	0.46	0.36	0.50	0.46	0.32	0.38	0.47	0.41	0.39	0.44	0.44
NDVI 11-20	0.43	0.41	0.38	0.43	0.46	0.46	0.42	0.45	0.46	0.31	0.41	0.43	0.43	0.39	0.44	0.42
NDVI 11-27	0.42	0.37	0.39	0.47	0.38	0.36	0.39	0.40	0.38	0.30	0.32	0.35	0.39	0.27	0.41	0.38
NDVI 03-09	0.36	0.33	0.30	0.40	0.35	0.29	0.34	0.37	0.30	0.26	0.33	0.33	0.32	0.29	0.37	0.37
NDVI 03-17	0.35	0.33	0.29	0.39	0.33	0.31	0.33	0.38	0.31	0.32	0.34	0.34	0.32	0.33	0.35	0.37
NDVI 03-25	0.40	0.36	0.35	0.42	0.38	0.36	0.38	0.43	0.37	0.36	0.38	0.37	0.36	0.37	0.39	0.39
NDVI 04-02	0.43	0.37	0.36	0.45	0.39	0.36	0.38	0.45	0.37	0.37	0.41	0.38	0.37	0.38	0.42	0.43
NDVI 04-10	0.50	0.43	0.42	0.52	0.44	0.44	0.43	0.50	0.46	0.41	0.45	0.44	0.44	0.43	0.46	0.48
NDVI 04-18	0.55	0.48	0.47	0.57	0.49	0.47	0.50	0.55	0.48	0.48	0.49	0.48	0.47	0.50	0.52	0.53
NDVI 04-26	0.60	0.53	0.52	0.60	0.54	0.56	0.54	0.61	0.57	0.52	0.54	0.54	0.53	0.55	0.56	0.58
NDVI 05-04	0.60	0.56	0.58	0.60	0.59	0.59	0.59	0.61	0.60	0.57	0.58	0.58	0.58	0.61	0.58	0.60
NDVI 05-12	0.63	0.57	0.63	0.62	0.61	0.66	0.61	0.65	0.66	0.62	0.62	0.64	0.63	0.62	0.60	0.62
NDVI 05-20	0.67	0.63	0.67	0.66	0.67	0.71	0.67	0.68	0.70	0.67	0.66	0.69	0.69	0.70	0.64	0.67
NDVI 05-28	0.69	0.64	0.69	0.69	0.68	0.70	0.68	0.70	0.71	0.68	0.70	0.69	0.70	0.71	0.66	0.71

Table 3. Mean correlation coefficients calculated separately for each province and averaged between predictors (CWB_1, CWB_2, and NDVI for 8-day periods presented as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09) with the yield of various crops.

Crop Yield	Basic Cereals	Wheat	Rye	Triticale	Barley	Rapeseed
CWB_1	−0.044	−0.026	−0.116	−0.035	−0.076	0.030
CWB_2	0.064	0.101	0.041	0.025	0.091	−0.027
NDVI 10-19	0.055	0.050	0.089	0.098	0.042	0.238
NDVI 10-27	0.294	0.273	0.299	0.278	0.270	0.291
NDVI 11-04	0.213	0.186	0.242	0.219	0.175	0.092
NDVI 11-12	0.197	0.172	0.216	0.261	0.237	0.233
NDVI 11-20	−0.062	−0.072	−0.006	−0.020	−0.068	0.000
NDVI 11-27	−0.060	−0.068	−0.051	−0.068	−0.072	−0.039
NDVI 03-09	0.491 *	0.466 *	0.503 *	0.489 *	0.458 *	0.454 *
NDVI 03-17	0.399	0.355	0.397	0.385	0.340	0.370
NDVI 03-25	0.481 *	0.443 *	0.480 *	0.434 *	0.452 *	0.324
NDVI 04-02	0.275	0.258	0.279	0.281	0.278	0.247
NDVI 04-10	0.374	0.368	0.356	0.361	0.347	0.324
NDVI 04-18	0.112	0.129	0.087	0.111	0.060	0.198
NDVI 04-26	0.166	0.166	0.159	0.175	0.127	0.277
NDVI 05-04	0.022	0.032	−0.007	0.023	−0.035	0.213
NDVI 05-12	0.084	0.052	0.103	0.120	0.071	0.109
NDVI 05-20	−0.241	−0.245	−0.248	−0.238	−0.189	−0.104
NDVI 05-28	−0.265	−0.253	−0.278	−0.242	−0.243	−0.215

* statistically significant correlation in meta-analysis at 0.05 significance level based on the weighted average of Fisher’s Z-transformed correlations. Red color of the cell background indicate positive correlations, blue color indicate negative correlations; the higher intensity of the color, the stronger correlation.

Table 4. Statistical parameters of prediction accuracy of crop yields (dt/ha) depending on CWB_1, CWB_2, and NDVI for 8-day periods for the studied prediction models.

	Linear Regression (80% Training Set, 20% Test Set)	Linear Regression (All Datasets)	Random Forest	Neural Networks
	Basic cereals
R²	0.712	0.777	0.570	0.711
MAE	3.328	2.942	4.640	3.424
RMSE	4.165	3.688	5.424	4.442
	Wheat
R²	0.658	0.754	0.544	0.723
MAE	3.994	3.449	5.208	4.049
RMSE	5.032	4.318	6.021	4.871
	Rye
R²	0.622	0.732	0.561	0.777
MAE	2.667	2.273	3.279	2.070
RMSE	3.336	2.854	4.006	2.731
	Triticale
R²	0.584	0.692	0.524	0.533
MAE	3.397	2.952	4.158	3.449
RMSE	4.226	3.699	5.078	4.430
	Barley
R²	0.551	0.664	0.554	0.617
MAE	3.385	2.930	3.658	3.654
RMSE	4.262	3.761	4.574	4.487
	Rapeseed
R²	0.214	0.421	0.654	0.450
MAE	3.234	2.804	2.287	2.891
RMSE	4.009	3.496	2.696	3.677

Table 5. Coefficients of regression for crop yield (dependent variable) prediction based on multiple linear regression with stepwise selection of the independent variables, where CWB_1, CWB_2, and NDVI for 8-day periods are predictors (CWB is expressed in mm, crop yield in dt/ha, and values of NDVI are presented for 8-day NDVI periods as the midpoints, e.g., for period 03/06–03/13, NDVI was presented as NDVI 03-09). The provinces are treated as binomial dummy independent variables. The coefficients allow us to predict crop yields for each province *.

Crop Yield (dt) of	Basic Cereals	Wheat	Rye	Triticale	Barley	Rapeseed
Intercept	44.74	49.60	29.97	35.97	37.04	28.35
CWB_1			−0.01
CWB_2
NDVI 10-19	−6.65	−6.32		−5.65	−10.00
NDVI 10-27
NDVI 11-04						−4.77
NDVI 11-12	9.05	8.01	6.83	9.46	10.91	6.28
NDVI 11-20
NDVI 11-27
NDVI 03-09	12.37	14.03	8.31	13.31	12.64	9.38
NDVI 03-17						4.72
NDVI 03-25	17.07	11.87	12.97	9.61	10.67
NDVI 04-02		−9.42
NDVI 04-10		17.53		8.22	5.52	11.21
NDVI 04-18
NDVI 04-26
NDVI 05-04						11.48
NDVI 05-12	14.55		12.84	15.91	7.72
NDVI 05-20	−13.17		−10.45	−14.19
NDVI 05-28	−17.71	−21.14	−14.91	−16.99	−19.89	−24.66
Province (binomial dummy variable)
Lower Silesian
Kuyavian–Pomeranian				5.13
Lublin	−2.74		−1.94
Lubusz	−7.56	−6.84	−2.13		−4.05	−2.29
Łódź	−10.15	−8.36	−4.53		−5.78
Lesser Poland	−5.62	−7.35		−3.82		2.70
Masovian	−12.34	−10.76	−5.45	−3.99	−6.12
Opole	8.66	9.70	6.68	6.95	8.24	2.28
Subcarpathian	−8.03	−9.74	−3.29	−4.88	−3.99
Podlaskie	−10.79	−10.79	−3.70	−1.57	−4.33	3.48
Pomeranian		5.68				2.64
Silesian	−5.53	−3.67	−2.66		−2.36
Holy Cross	−10.71	−12.65	−5.64	−5.05	−6.05
Warmian–Masurian			2.79	5.35
Greater Poland	−4.40			4.28
West Pomeranian			5.88	4.69	3.12

* For example, the regression equation for forecasting wheat yield for Mazovian province is as follows: wheat yield = 49.60 − 6.32 × (NDVI 10-19) + 8.01 × (NDVI 11-12) + 14.03 × (NDVI 03-09) + 11.87 × (NDVI 03-25) − 9.42 × (NDVI 04-02) + 17.53 × (NDVI 04-10) − 21.14 × (NDVI 05-28) − 10.76.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Okupska, E.; Gozdowski, D.; Pudełko, R.; Wójcik-Gront, E. Cereal and Rapeseed Yield Forecast in Poland at Regional Level Using Machine Learning and Classical Statistical Models. Agriculture 2025, 15, 984. https://doi.org/10.3390/agriculture15090984

AMA Style

Okupska E, Gozdowski D, Pudełko R, Wójcik-Gront E. Cereal and Rapeseed Yield Forecast in Poland at Regional Level Using Machine Learning and Classical Statistical Models. Agriculture. 2025; 15(9):984. https://doi.org/10.3390/agriculture15090984

Chicago/Turabian Style

Okupska, Edyta, Dariusz Gozdowski, Rafał Pudełko, and Elżbieta Wójcik-Gront. 2025. "Cereal and Rapeseed Yield Forecast in Poland at Regional Level Using Machine Learning and Classical Statistical Models" Agriculture 15, no. 9: 984. https://doi.org/10.3390/agriculture15090984

APA Style

Okupska, E., Gozdowski, D., Pudełko, R., & Wójcik-Gront, E. (2025). Cereal and Rapeseed Yield Forecast in Poland at Regional Level Using Machine Learning and Classical Statistical Models. Agriculture, 15(9), 984. https://doi.org/10.3390/agriculture15090984

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cereal and Rapeseed Yield Forecast in Poland at Regional Level Using Machine Learning and Classical Statistical Models

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Input Data

2.2. Statistical Data Analysis

3. Results

3.1. Basic Statistics of Input Data

3.2. Correlation Coefficients Between Predictors and Crop Yields

3.3. Results of Crop Yield Prediction Based on Linear Regression and Machine Learning Models

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI