Cotton Yield Prediction with Gaussian Distribution Sampling and Variational AutoEncoder

Lan, Yaqi; Wang, Xiudong; Gao, Lei; Chen, Xiaoliang

doi:10.3390/app15189947

Open AccessArticle

Cotton Yield Prediction with Gaussian Distribution Sampling and Variational AutoEncoder

by

Yaqi Lan

¹

,

Xiudong Wang

^1,2,3,

Lei Gao

^1,* and

Xiaoliang Chen

^4,5

¹

Institute of Agricultural Economics and Development, Chinese Academy of Agricultural Sciences, Beijing 100081, China

²

Center for Strategic Studies, Chinese Academy of Agricultural Sciences, Beijing 100081, China

³

Chinese Institute of Agricultural Development Strategies, Beijing 100081, China

⁴

School of Computer and Software Engineering, Xihua University, Chengdu 610039, China

⁵

Department of Computer Science and Operations Research, University of Montreal, Montreal, QC H3C3J7, Canada

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(18), 9947; https://doi.org/10.3390/app15189947

Submission received: 2 August 2025 / Revised: 2 September 2025 / Accepted: 3 September 2025 / Published: 11 September 2025

(This article belongs to the Section Agricultural Science and Technology)

Download

Browse Figures

Versions Notes

Abstract

Accurate cotton yield prediction is crucial for agricultural production management, resource optimization, and market supply–demand balance. However, achieving high-precision cotton yield prediction faces significant challenges mainly because cotton growth is influenced by complex, nonlinear environmental factors. Traditional machine learning models struggle to fully capture these complex factors, and deep learning models typically rely on large amounts of high-quality data. The high cost of obtaining field measurement data leads to a scarcity of high-quality datasets, further limiting the performance of prediction models. To overcome these challenges, this study proposes a novel cotton yield prediction architecture—Gaussian distribution data augmentation and variational autoencoder (GD-VAE). This architecture’s configuration offers the following advantages: (1) it calculates the mean and covariance of existing data, with new samples conforming to the original data distribution being sampled and generated to effectively expand the training dataset by utilizing Gaussian distribution data; (2) it uses an end-to-end variational autoencoder (VAE) that automatically learns the low-dimensional, compact, and discriminative feature representations of the input data. Specifically, GD-VAE uses a Gaussian distribution to model the original cotton yield data and generates augmented data through sampling. The VAE then learns deep feature representations from these data, which are fed into a regressor for final yield prediction. To evaluate the performance of GD-VAE, we conducted extensive tests under challenging cross-year and cross-district conditions. In the cross-year test in Bahawalnagar, Pakistan, GD-VAE achieved a root mean square error (RMSE) of 58.4 lbs/acre, a mean absolute error (MAE) of 38.19 lbs/acre, and a coefficient of determination (

R^{2}

) of 0.65 between the actual and predicted yields. In the more challenging cross-year and cross-district test in Turkey, GD-VAE achieved an RMSE of 46.46 kg/da, an MAE of 37.74 kg/da, and an

R^{2}

of 0.14. The results indicate that the GD-VAE architecture significantly improves the accuracy of cotton yield prediction under limited data conditions through effective data augmentation and deep feature learning. This research provides an effective technical means for predicting challenges in agriculture with limited samples, which has important practical significance for ensuring global food security and sustainable agricultural development (to enhance analytical tractability, we use each district’s value by converting kg/ha to 1 lbs/acre, with 1.121 kg/ha converting to 1 kg/da, which is equivalent to 10 kg/ha).

Keywords:

cotton; cotton yield prediction; data augmentation; AutoEncoder; deep learning

1. Introduction

Currently, global climate change is continuing to intensify, with extreme weather events occurring more frequently, temperature fluctuations increasing, and changes in precipitation patterns, among other factors, interacting to exert a profound impact on agricultural production systems [1]. As a result, global food security is facing increasingly severe challenges. Among various crops, cotton, as one of the most important economic crops globally, exhibits high sensitivity to climate change [2,3]. Even minor fluctuations in climate factors such as temperature, moisture, and sunlight during its growth cycle can significantly impact its yield and quality. Cotton is not only a key source of natural fiber but also an integral part of the agricultural economy and foreign trade of many countries, particularly in developing nations, where the cotton industry is closely tied to the livelihoods of millions of farmers and the stability of national economies [4]. Therefore, developing a reliable cotton yield prediction model is of critical importance for achieving sustainable management of agricultural resources, formulating adaptive policies, and maintaining the economic resilience of relevant countries and districts.

Although traditional approaches ranging from empirical statistical models to process-based simulations have been routinely used to predict cotton yields, they are confounded by difficulties in capturing the complex interactions and relationships among wide-ranging environmental factors such as soil heterogeneity and climate change [5,6]. In attempting to address this limitation, researchers have been attempting to innovatively tap into the advantages of machine learning’s (ML) demonstrated utility in advancing our understanding of wide-ranging environmental issues [7] through multi-objective optimization of organic waste gas processing [8], bank loan prediction [9], and other innovative explorations. Unfortunately, however, ML methods also struggle to capture important nonlinear relationships, which leads to the problem of insufficient generalization ability in settings with complex environmental factors.

Different from statistical machine learning models, deep learning models have the characteristics of flexibility and adaptability, which enable them to automatically extract logical features from data [10]. Deep learning models [11] are usually composed of multiple neural networks with a large number of learnable parameters and multiple nonlinear activation functions, which enable them to capture more complex nonlinear relationships and data characteristics. These characteristics explain why deep learning models usually outperform machine learning models when dealing with complex data scenarios [12]. Following their rapid application in various fields [13,14], their use has been extended to include different sectors in agriculture, such as crop target detection [15] and immature fruit sorting [16]. Unfortunately, however, extended use of these models is constrained by their huge data requirements, which is problematic because collecting these data is prohibitively costly due to the collection of cotton yield data being affected by multiple factors.

In terms of environmental impact, if cotton suffers from long-term sustained high temperatures, wind-related disasters, frost damage, planting factors, and other factors, model prediction will have a large deviation [17,18]. Among the planting factors is the interaction between soil salinization and water stress, which increases the complexity of the data. Different soil water potential thresholds also substantially alter the distribution of salinity and water utilization, with salinity and the time-consuming, costly recording of field data exacerbating difficulties that are encountered in the acquisition of usable data [19]. Collecting enough cotton yield data is also undermined by its long growing cycle, which hinders the timely development of intelligent cotton yield predictions.

To address the challenges of data scarcity and the model’s inability to capture complex nonlinear relationships, this study proposes an architecture based on Gaussian distribution data augmentation and variational autoencoders (GD-VAEs) to achieve more accurate cotton yield predictions. We do this by proposing a Gaussian distribution augmentation methodology that generates virtual data which closely resembles the distribution of real data by first calculating the real data’s mean and covariance, after which the new Gaussian data are fitted to the mean and covariance. We then build on the existing data. This processing is followed by the construction of a variational autoencoder-learner that identifies the internally embedded cotton yield data patterns and their corresponding features. The hypothesis on which this study is based is that Gaussian distribution sampling and variational autoencoding (GD-VAE) are better able to address the scarcities of high-quality cross-year and cross-district cotton yield predictions compared to traditional machine learning methodologies. The objective of this contribution is to demonstrate the cotton yield prediction superiority of GD-VAE over the conventional off-the-shelf techniques that are routinely used by many researchers without due consideration of their limited potential to provide usable information under challenging cross-year and cross-district scenarios.

2. Materials and Methods

2.1. Study Areas

The study areas include the Bahawalnagar district in Pakistan and 81 purposefully selected districts in Turkey (Table 1).

In this paper, data on the yield of cotton in these two study areas were used to predict this crop’s yield. Facing increasing environmental challenges, the Bahawalnagar district has become an important testing ground for technological innovations that target the production of cotton in Pakistan’s Bahawalnagar district and others [20]. This district has a typical tropical arid climate, with an average annual temperature of 28 °C and an uneven distribution of annual precipitation of less than 200 mm, which makes cotton production under natural hydrothermal conditions extremely stressful and unpredictable. These environmental conditions justify why it is necessary to aid agriculture by providing a methodology that can be used to reliably predict cotton yield in advance. This information is also critical for the Aegean Sea Plain and Haran Plain cotton-producing areas in Turkey, which are located in arid and semi-arid areas that are extremely sensitive to unpredictable climate change- driven fluctuations in weather patterns. These fluctuations play an important role in determining yield and quality because below-average and above-average rainfall can reduce yield, although fiber strength increases by ∼2.3% when average daily temperatures increase by ∼1 °C, and the reverse occurs for fiber length [21]. In view of these factors, with other considerations, it is apparent that there is an urgent need to provide a cost-effective and reliable methodology that can be used to predict cotton yield under the prevailing climatic conditions in Pakistan’s and Turkey’s cotton-producing areas, as well as elsewhere.

2.2. Datasets

This study was mainly conducted on two datasets, namely, the cotton data collected from Bahawalnagar and 81 districts in Turkey. The Bahawalnagar data contains records from 1999 to 2021, with each year including a timeline from May to September. There are up to 45 metrics per dimension, such as Normalized Difference Vegetation indicator (NDVI), Normalized Difference Build-up indicator (NDMI), Plant Senescence Reflectance indicator (PSRI), Soil Adjusted Vegetation indicator (SAVI), Green Normalized Difference Vegetation indicator (GNDVI), etc. As shown in Figure 1, it can be observed that the mean NDVI was stable, ranging between 10 and 0.15 from 1999 to 2015, after which it periodically increased from 2016. However, the increased periodicity of NDVI may lead to data instability, which reduces the model’s inter-year yield prediction generalization capabilities.

The Turkey data covers 81 districts for the 2019–2023 period over yearly time slices that stretch from June to October. For the Turkish data, first, as shown in Figure 2, from the perspective of variable distribution, most indicators, such as EVI_filtered, swvl1, t2m_max, tp, soc, and cec, exhibit relatively dispersed distributions and do not show obvious normal or skewed concentration trends. This distribution characteristic indicates significant differences in environmental and soil conditions for cross-district and -year, increasing the difficulty for models to capture consistent patterns and reflecting the ecological diversity of Turkey’s cotton-growing districts. Second, as shown in Figure 3, the spatial distribution map indicates that key variables, such as YIELD, EVI_filtered, swvl1, and t2m_max, exhibit significant fluctuations for cross-district and -year. For example, t2m_max (maximum temperature) remains consistently high in certain southern districts (such as Antakya and Hassa), whereas swvl1 (surface soil moisture) exhibits substantial variability across different years, potentially linked to uneven precipitation distribution or differences in irrigation conditions. Yield (YIELD) is relatively high and stable in some districts, such as Salihli and Soke, but exhibits significant fluctuations in areas like Viransehir and Derik, indicating notable differences in district management practices or climate adaptation capabilities. This spatial heterogeneity suggests the need to incorporate district-specific characteristics into predictive models or employ spatial explicit modeling. As further revealed by the correlation analysis of variables in Figure 4, the correlations between environmental factors and yield are generally weak. For example, the correlation coefficient between EVI_filtered and YIELD is only 0.14, swvl1 and YIELD is 0.19, and t2m_max and YIELD show a negative correlation (−0.39). Notably, there is a strong negative correlation between t2m_max and swvl1 (−0.57), indicating that high temperatures often accompany soil dryness, potentially leading to a compound effect on cotton water stress. Additionally, soil properties such as SOC (soil organic carbon) and CEC (cation exchange capacity) have weak correlations with YIELD (0.16 and −0.32, respectively), suggesting that the direct impact of soil fertility on yield is not prominent in this dataset and may be masked by other factors (such as irrigation or variety). Finally, Figure 5 visually illustrates the nonlinear, non-monotonic relationship between the various variables and production. For example, EVI_filtered exhibits a positive correlation with yield within the range of 0.45–0.55, but the relationship weakens or even reverses beyond this range; swvl1 shows a positive correlation with yield between 0.15 and 0.25, with higher or lower values corresponding to lower yields; and t2m_max results in a significant decline in yield once it exceeds 308K, suggesting the presence of high-temperature stress. These patterns indicate that simple linear models struggle to capture the complex interactions between variables and yield, necessitating the introduction of nonlinear modeling methods (such as neural network structures) to better express their underlying mechanisms.

2.3. Methods

Since recording the cotton yield data required by deep learning involves high costs, this study proposes a method that produces new data similar to the original data based on Gaussian distribution data augmentation. To model the cotton yield prediction data, we constructed a variational auto-encoder (VAE) to learn the feature information and intrinsic patterns of the cotton yield data. This was necessary because, as alluded to in preceding sections, cotton yield is often affected by changes in environmental factors, resulting in significant uncertainty in potential yields for different years. We opted to use VAE because it can effectively handle the uncertainty and variability in the data through a variational process that minimizes the differences in data distributions in the latent space. The following subsections (Section 2.3.1 and Section 2.3.2) provide a detailed overview of the methods that were used in this investigation.

2.3.1. Gaussian Distribution Data Augmentation

The data that we used to build our deep learning model was segregated into two groups, comprising (1) the training set

D_{t r a i n}

and (2) the test set

D_{t e s t}

. Then, during the training phase, we extracted a mini-batch

(x, y)

from

D_{t r a i n}

as the input for the model, where

x \in R^{b \times m \times d}

. In this model, there is a high-dimensional vector composed of various indicators and parameters related to cotton production. These parameters include b, which represents the size of each extracted mini-batch, m, which represents the length of the month, d, which represents the dimension of the data, and

y \in R^{b}

, which represents the corresponding production. After the model training was completed, we input the B data into the model to predict cotton yield. We ran the model by dividing it into different years to avoid overlapping the training and test datasets. This partitioning was helpful because, apart from aligning well with the expected future yield predictions, it also allowed us to test the model’s generalization capabilities.

Neural networks require a large amount of data due to their high parameter requirements. However, the process of planting cotton in a field and recording the environmental characteristics and yield is lengthy and requires a high cost from human labor, resulting in the scarcity of relevant cotton yield data. Building a good deep learning model from a small amount of data to obtain new data through data augmentation is an intuitive idea. However, cotton yield data is usually serialized numerical data, making some traditional data augmentation methods inapplicable [22]. Dropout can achieve data augmentation effects by randomly deleting a certain proportion of sequence features [23]. It has quite good effects in many fields. However, each indicator or parameter in the cotton yield data is very crucial for predicting total yield. Using Dropout for data augmentation omits some key features in the cotton yield data, which may be very important environmental factors, resulting in large yield prediction errors. In response to this, we propose to sample new data from a Gaussian distribution that conforms to the original data and combine the new data with the original data for deep learning training of the neural network.

Specifically, we first obtain the data

(x_{y e a r}, y_{y e a r}) \in D_{t r a i n}

in units of years from the cotton yield data used for training. Then, the mean of the feature vector for the i-th year is calculated as the average value in that vector. The calculation is shown in Equation (1):

μ_{y e a r}^{i} = \frac{\sum_{j = 1}^{m^{i}} x_{y e a r}^{j}}{m^{i}}

(1)

where

x_{y e a r}^{j}

represents the feature vector for the j-th month of the i-th year, and

m^{i}

represents the total number of months in that year. Since the feature vector

x_{y e a r}^{j}

has multi-dimensional characteristics, we use the covariance matrix to more accurately represent the variance relationships among its elements. The calculation formula for the covariance matrix

\sum_{y e a r}^{i}

of the features in the i-th year is shown in Equation (2):

\sum_{y e a r}^{i} = \frac{1}{m^{i}} \sum_{j = 1}^{m^{i}} (x_{y e a r}^{j} - μ_{y e a r}^{i}) {(x_{y e a r}^{j} - μ_{y e a r}^{i})}^{T} * α

(2)

where

α = 0.5

is a hyperparameter used to determine the degree of discretization when sampling features from the calibrated distribution. It is worth noting that this process is carried out before training the model because the cotton yield data is inherently a vector of numerical serializations. To simplify the description, the calculated distribution is represented as a set of statistics. For the set of statistics represented as

S_{y e a r} = {(μ_{y e a r}^{1}, \sum_{y e a r}^{1}), (μ_{y e a r}^{2}, \sum_{y e a r}^{2}), \dots, (μ_{y e a r}^{n}, \sum_{y e a r}^{n})}

, n represents the total number of years in the training set.

In this regard, for the i-th year in the cotton production data, we generated a set of feature vectors similar to the i-th year by sampling from the calculated Gaussian distribution using a set of passed statistical quantities

S_{y e a r}^{i} \in S_{y e a r}

, where

1 \leq i \leq n

. The specific process is shown in Equation (3):

D_{t r a i n}^{i} = {(x_{n e w}^{i}, y_{n e w}^{i} \in y_{y e a r}) | x_{n e w}^{i} \sim N (μ_{y e a r}^{i}, \sum_{y e a r}^{i}), \forall (μ_{y e a r}^{i}, \sum_{y e a r}^{i}) \in S_{y e a r}^{i}}

(3)

After sampling new data, the total number of features generated for each year is set as a fixed hyperparameter. The generated features, together with the original data features, constitute the training data for the specific cotton yield prediction model, namely

(x_{y a e r}^{i}, y_{y e a r}^{i}) = Concat ((x_{y a e r}^{i}, y_{y e a r}^{i}), (x_{n e w}^{i}, y_{n e w}^{i})), (x_{n e w}^{i}, y_{n e w}^{i}) \in D_{t r a i n}^{i}

.

2.3.2. Variational AutoEncoder

This model combines the feature learning capability of the variational autoencoder (VAE) with the regression prediction module. VAE does not directly output a definite latent variable

z_{vae}

, but rather outputs the distribution of the latent variable, namely the mean

μ_{vae}

and the standard deviation

σ_{vae}

. Samples are taken from this distribution to obtain

z_{vae}

. This enables the latent space learned by VAE to be structured, allowing for better learning of the intrinsic characteristics of the data. The overall architecture of the constructed VAE model is shown in Figure 6. It consists of three parts in total: the encoder

q_{ϕ}

, the decoder

p_{θ}

, and the regressor

r_{δ}

. Next, we will detail the entire VAE-based cotton yield prediction modeling process.

Encoder. Its function is to extract spatiotemporal features and conduct potential space modeling. During this process, the encoder learns the intrinsic feature information of the data and then compresses it into a low-dimensional potential space, providing reliable data information for subsequent decoding and prediction. The constructed encoder consists of an encoding Bi-LSTM, a LeakyReLU activation function, and a decoder (fully connected layer), and its structure is shown in Figure 6 as ①. Specifically, we first sample a mini-batch

(x, y)

from the training set and input it into the encoder to obtain its spatiotemporal features. Here, we use Bi-LSTM to obtain the spatiotemporal information of cotton production data for each year and each month. The process is shown in Equations (4)–(7):

\vec{H_{i}^{x}} = \vec{L S T M} (\overset{\leftarrow}{H_{i - 1}^{x}}, x_{i}), i = 1, 2, \dots, t

(4)

\overset{\leftarrow}{H_{i}^{x}} = \overset{\leftarrow}{L S T M} (\overset{\leftarrow}{H_{i + 1}^{x}}, x_{i}), i = t, \dots, 2, 1

(5)

H_{i}^{x} = Concat (\vec{H_{i}^{x}}, \overset{\leftarrow}{H_{i}^{x}})

(6)

H^{x} = [H_{1}^{x}, H_{2}^{x}, \dots, H_{t}^{x}]

(7)

where

\vec{H_{i}^{x}}

and

\overset{\leftarrow}{H_{i}^{x}}

, respectively, represent the features of Bi-LSTM on the forward and backward time axes of the i-th instance, and

H_{i}^{x}

represents the extracted contextual spatiotemporal feature information.

H_{i}^{x}

contains the contextual information of each month in different years, effectively capturing the periodic patterns of cotton growth. Then, we obtain the output of the encoder by inputting

H_{i}^{x}

into the activation function and the linear encoder, as shown in Equation (8):

e n c^{x} = w_{e} \cdot LeakyReLU (H^{x}) + b_{e}

(8)

where

w_{e}

and

b_{e}

represent the weights and biases of the encoder linear, respectively. After obtaining the output of the encoder, we calculated the mean and standard deviation of the latent variables through two different linear layers, and then obtained the latent representation

z_{vae}^{x}

based on the mean

μ_{{vae}^{x}}

and standard deviation

σ_{vae}^{x}

. The process is shown in Equations (9)–(11):

μ_{vae}^{x} = w_{μ} \cdot e n c^{x}

(9)

σ_{vae}^{x} = w_{σ} \cdot e n c^{x}

(10)

z_{vae}^{x} = μ_{vae}^{x} + σ_{vae}^{x} \times ϵ, ϵ N (0, I)

(11)

where

w_{μ}

and

w_{σ}

represent the weights of two different linear layers, and

ϵ

is the noise sampled from the standard normal distribution. Finally, we calculated

KL (q_{ϕ} (z | x) | p_{θ} (z))

, which measures the difference between the variational posterior distribution of the encoder and the prior distribution. The calculation formula for D is shown in Equation (12):

KL (q_{ϕ} (z | x) | p_{θ} (z)) = \frac{1}{2} ({μ_{vae}^{x}}^{2} + {σ_{vae}^{x}}^{2} - \log ({σ_{vae}^{x}}^{2}) - 1)

(12)

Decoder. Its function is to reconstruct the latent representation obtained through the compression calculation of the encoder and impose end-to-end representation consistency constraints. During this process, the reconstruction task forces the model to learn the nonlinear relationships among the various features or parameters of the cotton yield data. The constructed encoder consists of a decoding Bi-LSTM, a LeakyReLU activation function, and a decoder output (fully connected layer), and its structure is shown in Figure 6 as ②. The process is as follows: First, we obtain the context spatiotemporal features

H^{z}

through the decoding Bi-LSTM, which is similar to Equations (4)–(7). Then, we obtain the reconstruction output of the decoder by inputting

H^{z}

into the activation function and the linear decoder, as shown in Equation (13):

\hat{x} = w_{d} \cdot LeakyReLU (H^{z}) + b_{d}

(13)

where

w_{d}

and

b_{d}

represent the weights and deviations of the docking point, respectively. After obtaining the reconstructed feature

\hat{x}

, we calculated the reconstruction error

L_{2} (x, \hat{x})

, as shown in Equation (14):

L_{2} (x, \hat{x}) = \sum_{k = 1}^{b} {(x^{k} - {\hat{x}}^{k})}^{2}

(14)

where b is the size of the mini-batch.

Regressor. Its purpose is to achieve end-to-end production prediction. The structure of the regressor is shown in Figure 6 as ③. Its structure mainly consists of a LeakyReLU activation function and a fully connected layer. The prediction calculation process is shown in Equation (15):

l o g i t s = w_{r} \cdot LeakyReLU (H^{x}) + b_{r}

(15)

where

l o g i t s

represents the predicted cotton production. Finally, after obtaining the predicted production, we calculate the mean square error between it and the actual production y. The calculation formula is shown in Equation (16):

M S E_{l o g i t s} = \frac{1}{b} \sum_{k = 1}^{b} {(y^{k} - l o g i t s^{k})}^{2}

(16)

Model optimization. After calculating the errors of each aspect of the model, we perform model optimization using the gradient descent algorithm. The total optimization error loss

L_{t o t a l}

is shown in Equation (17):

L_{t o t a l} = M S E_{l o g i t s} + L_{2} (x, \hat{x}) + KL (q_{ϕ} (z | x) | p_{θ} (z))

(17)

Finally, we trained the constructed VAE by minimizing the total loss.

2.3.3. Experiment Setup

After describing the data and methods, we proceeded with the experimental design. Figure 7 illustrates the specific process of our method, as well as method evaluation, baseline comparison, etc. Below, we will introduce the specific data partitioning and baseline methods.

Dataset division settings.
To effectively verify the generalization ability of the model, we designed three different data partitioning methods. The first method is the cross-year validation setting, which is mainly applied to the 23-year data of Bahawalnagar. We use the data from 1999 to 2016 as the training set and the data from 2017 to 2021 as the test set. As shown in Figure 1, we can observe that the distribution of cotton data in Bahawalnagar fluctuates significantly in the 23-year data. Therefore, this setting can effectively verify the model’s generalization ability for future cotton data. The second is the cross-district setting, which is mainly applied to 81 districts in Turkey. We completely distinguish the districts of the training set and the test set so that the training and test set districts are disjointed. This setting can effectively help the yield prediction model be extended to unknown districts and, thus, has extremely high research value. The third setting is the cross-year and cross-district setting, which is also applied to the data in Turkey. In this setting, we use the data from 2019 to 2021 as the training set and the data from 2022 to 2023 as the test set. This setting has extremely high research value because it enables the model to predict future yields across districts. Please note that in all settings, the districts in the training set and the test set are completely disjointed. Additionally, the evaluation results of the data are based on the test set, and the training set is used by the model to learn the skills for predicting cotton yields. The statistical information of the dataset is shown in Table 2.
Comparison benchmarks and assessment methods for cotton yield prediction.
In order to verify that the proposed method is effective and advanced, we compared it with nine different baseline methods, including five machine learning algorithms: linear regression (LR), random forest regression (RFR), support vector regression (SVR), ridge regression (RR) and logistic regression (La), as well as four deep learning neural network models: multi-layer perceptron (MLP), multi-scale convolutional neural network (Multi-scale CNN), bidirectional recurrent neural network (Bi-LSTM), and Transformer [24].
To evaluate and compare the performance of different methods, we calculated three important production prediction indicators, namely root mean square error (RMSE), mean absolute error (MAE), and the $R^{2}$ coefficient. The calculation methods for these three indicators are as follows:

$RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{t e s t}^{i} - l o g i t s^{i})}^{2}}$

(18)

$MAE = \frac{1}{N} \sum_{i = 1}^{N} (y_{t e s t}^{i} - l o g i t s^{i})$

(19)

$R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(y_{t e s t}^{i} - l o g i t s^{i})}^{2}}{\sum_{i = 1}^{N} (y_{t e s t}^{i} - {\bar{y}}^{i})}$

(20)

where y represents the actual yield in the test dataset, ${\bar{y}}^{i}$ represents the mean of y, and $l o g i t s$ represents the predicted yield in the test dataset.

3. Results and Discussion

3.1. Contrastive Analysis

This study systematically evaluated the performance of the proposed GD-VAE model and nine baseline methods on three challenging cotton yield prediction tasks. Figure 8 shows a comparative overview of the RMSE, MAE, and

R^{2}

metrics that were obtained for the Bahawalnagar and Turkey study area.

First, in the Bahawalnagar data results (Figure 8a,d,g), in the single-district cross-year prediction task, GD-VAE significantly outperforms all baseline models, with an RMSE of 58.4 kg/acre and an MAE of 38.19 kg/acre. Its key advantage lies in the fact that RFR, the next-best model (RMSE = 69.3 kg/acre; MAE = 58.17 kg/acre), has errors 18.6% and 31.3% higher than GD-VAE, which stems from the limitations of random forests in capturing long-term temporal patterns. The GD-VAE effectively extracts the nonlinear trends in historical data through the variational encoding mechanism, overcoming the shortcomings of traditional statistical models in non-stationary sequences [25]. In contrast, GD-VAE effectively extracts nonlinear trends from historical data through its variational encoding mechanism. Compared to deep learning models, such as Transformer (RMSE = 98.47 kg/acre) and Bi-LSTM (RMSE = 97.24 kg/acre), GD-VAE reduces RMSE by approximately 40%, indicating that its design effectively filters the observational noise caused by environmental factors. Traditional linear models (LR/RR/LaR) completely failed in this task (

R^{2} \leq

−64.96), confirming the complex nonlinear relationships between cotton yield and environmental factors. The

R^{2}

value of 0.65 for GD-VAE confirmed its modeling capabilities, breaking through the dependence of early climate suitability models on linear assumptions [26].

Furthermore, in cross-district transfer prediction performance in Turkey (Figure 8b,e,h), GD-VAE continues to lead, with an RMSE of 38.59 kg/da and an MAE of 30.34 kg/da. Compared to the second-best model, RFR (RMSE = 51.22 kg/da), GD-VAE reduces errors by 24.6%. This advantage stems from its district-invariant feature extraction module, which effectively aligns climate–soil distribution differences across different districts [27]. Models like Transformer (RMSE = 55.70 kg/da) are limited in performance due to their reliance on large-scale training data. GD-VAE achieves hidden space regularization through variational inference, maintaining a prediction performance of 0.49

R^{2}

, even with limited samples. Its outstanding performance in this scenario (compared to Multi-scale CNN, 93.15 kg/da RMSE) validates that cross-district prediction can significantly reduce manual survey costs, requiring only historical data from some districts to support predictions for new districts.

Finally, in the most complex cross-district and cross-year prediction task in Turkey (Figure 8c,f,i), GD-VAE maintained its advantage, with an RMSE of 46.46 kg/da and an MAE of 37.74 kg/da. Although the Transformer improved its performance in this task (RMSE = 50.48 kg/da), it still lagged behind GD-VAE by approximately 8.6%. The significant degradation of RFR (RMSE = 80.20 kg/da) and Bi-LSTM (RMSE = 279.39 kg/da) exposed the modeling deficiencies of traditional models for long-term time series [25]. As the only model with a positive

R^{2}

(0.14), GD-VAE demonstrated that only historical data from some districts is needed to predict future yields in unknown districts, driving the development of automated decision-making systems.

As shown in the above illustration, Figure 8a,d,g, GD-VAE achieved the best results in the Bahawalnagar cross-year prediction, the Turkey cross-district prediction, and the Turkey cross-district and -year predictions.

Table 3 shows the comparison results between our method and another generative network, GAN. GAN [28] is a generative adversarial network that can generate samples that conform to the distribution of real data, so it can also be regarded as a data augmentation measure.

As can be seen from Table 3, GD-VAE outperforms the GAN network in all aspects, thanks to the use of a Gaussian distribution based on the characteristics of the samples themselves, which conforms to the standard distribution of the data. GAN networks rely on the quality of the neural network for data augmentation, and neural networks themselves are sensitive to the amount of data. This can be observed in the values

R^{2}

(−0.20→0.09) on the Turkish cross-district to cross-district and -year values, where the GAN network shows some improvement as the amount of data increases.

The degree of agreement between the model’s predicted cotton yield and the actual cotton yield was further validated by using scatter plots (Figure 9, Figure 10 and Figure 11), which provide a synoptic overview of the agreement between all method plots (Figure 9).

As shown in Figure 9, it is apparent that the proposed GD-VAE yield predictions are closer to the actual distributions (points closer to the red diagonal), with Figure 10 and Figure 11 showing the most accurate GD-VAE yield predictions.

3.2. Importance Feature Analysis

This study revealed district differences in key driving factors for cotton yield prediction through systematic feature ablation experiments. As shown in Figure 12, when specific features were removed, GD-VAE and the optimal benchmark model RFR exhibited significantly different response patterns on the Turkish dataset. In the environmental factor analysis of the Turkish dataset (Figure 12c,d, GD-VAE was most sensitive to the absence of ssrd (solar radiation); removing this feature caused the MAE to increase from 37.74 kg/da to 41.1 kg/da (an increase of 8.18%) and the RMSE to increase from 46.46 kg/da to 51.92 kg/da (an increase of 10.52%). This phenomenon aligns closely with the agroecological characteristics of Turkey’s primary cotton-producing district: the area has an annual average sunshine duration exceeding 2800 h (18% higher than the global average for cotton-growing districts) [29], making photosynthetically active radiation the primary limiting factor in yield formation. Comparisons reveal that RFR exhibits low sensitivity to ssrd deficiencies (MAE increase of only 5%), indicating that traditional machine learning methods fail to adequately capture the nonlinear response mechanisms of light energy utilization efficiency.

In the vegetation index analysis of the Bahawalnagar dataset (Figure 12a,b) the absence of SAVI (Soil-Adjusted Vegetation index) in the single-index analysis experiment caused the

R^{2}

of GD-VAE to plummet from 0.65 to 0.55 in the Bahawalnagar dataset (Table 4), revealing the special requirements for predictions in semi-arid districts; the coefficient of variation for soil background reflectance in this district reaches as high as 0.37 (significantly higher than the 0.21 in the Turkish cotton district), necessitating that the model utilize the soil brightness correction functionality of SAVI to accurately interpret vegetation cover information [30]. This finding corrects the cognitive limitations of previous studies that focused solely on the NDVI as a universal vegetation indicator, confirming the dominant role of soil–vegetation interactions in specific ecological districts. In Table 4, the

R^{2}

decline of GD-VAE in the Turkish dataset (from 0.14 to 0.07) is significantly smaller than that in Bahawalnagar (from 0.65 to 0.55), reflecting its stronger adaptability relative to RFR (

R^{2}

declined from −1.40 to −1.50 after feature ablation). However, future research should further quantify feature interaction effects (e.g., the synergistic effect of ssrd and SAVI) and explore the contribution of frost stress factors (e.g., daily minimum temperature t2m_min) in high-latitude cotton districts [31] to refine the feature engineering theoretical framework across diverse environmental scenarios.

3.3. Data Augmentation Effectiveness Analysis

In our method, a data augmentation strategy based on Gaussian distribution sampling is proposed. To verify its effectiveness, we conducted an effectiveness test, which is detailed in this section. Table 5 shows the test set results of the VAE without using data augmentation based on a Gaussian distribution and the GD-VAE we constructed. It can be observed that without data augmentation, the results of VAE in the three evaluation indicators (RMSE, MAE, and

R^{2}

) have decreased significantly. Among them, the

R^{2}

on the Turkish data has even dropped to a negative value, indicating that the model basically did not fit the training data without data augmentation. Therefore, this also proves that the data augmentation we proposed is effective. Additionally, in Figure 13, we show the scatter plot distribution of the sampled data and the real data based on Gaussian distribution sampling. It can be observed that the quality of the sampled data has a significant connection with the quality of the original data. For example, in the Bahawalnagar data (Figure 13a), if the original data (stars) are very close to each other, then the sampled data (circles) will overlap. If the sampled data overlaps with other data, it will add noise to the model’s learning, thereby causing the model to learn incorrect data and resulting in poor generalization. In the cross-district setting of Turkey (Figure 13b), if the distribution of the original data is relatively discrete, the sampled data will be relatively representative. Overall, the proposed data augmentation method based on Gaussian distribution sampling can effectively improve the performance of VAE.

Improving the quality of raw cotton yield data is the cornerstone of ensuring the accuracy of predictive models, especially for deep learning models, such as GD-VAE, which are sensitive to data quality [32]. If the raw data contains noise (such as sensor errors, abnormal climate records, or human recording biases), direct enhancement or simple interpolation may actually amplify the biases. According to the latest developments, combining multi-source sensor fusion technology [33] with adaptive filtering algorithms [34] can significantly improve data quality. Additionally, the analysis of time-series hyperspectral remote sensing imagery [35] should integrate spectral, temporal, and spatial multi-dimensional information, using object-oriented segmentation methods rather than traditional methods. This can partially eliminate noise in remote sensing and yield data, thereby improving yield estimation accuracy. A rigorous data quality control pipeline should include outlier detection, multi-source data cross-validation (e.g., comparing ground sensor data, remote sensing-derived yield estimates, and manual field measurements), and uncertainty quantification steps to provide reliable inputs for subsequent models.

4. Conclusions

Accurate forecasting of crop yields is important for all aspects of agriculture, from national farms to global agricultural economies. As one of the important cash crops in the world, cotton is widely used in many industries, such as textiles, food, and medicine. With climate change, soil quality change, and the continuous progress of agricultural technology, cotton yield is affected by many factors. Accurately predicting cotton yield is not only important for agricultural production planning, but it can also effectively help farmers reduce risks and improve returns. Accurate prediction of cotton production helps to ensure the stability of the cotton market and avoid situations of overproduction or undersupply. At present, the common cotton yield prediction methods include statistical models (regression analysis), machine learning algorithms (decision trees and support vector machines), and remote sensing techniques. The statistical model is simple and suitable for the case of stable data. Machine learning, on the other hand, can deal with complex nonlinear relationships and improve prediction accuracy by training on big data and historical data. Being data-driven, deep learning model performance depends highly on the diversity and representativeness of training data. In this study, we propose data augmentation based on the adoption of a Gaussian distribution to increase the learning data of deep learning models and construct a VAE architecture to perform cotton yield prediction. From the results, the proposed method is optimal compared with traditional machine learning methods or deep learning models. GD-VAE achieves an RMSE of 58.4l bs/acre, an MAE of 38.19 lbs/acre, and an

R^{2}

coefficient of 0.65 in a Bahawalnagar data setting. In the cross-district setting of Turkish data, GD-VAE can obtain an RMSE of 38.59 kg/da, an MAE of 30.34 kg/da, and an

R^{2}

coefficient of 0.49. In the dual setting of cross-year and cross-district, GD-VAE can obtain an RMSE of 46.46 kg/ha, an MAE of 37.74 kg/da, and an

R^{2}

coefficient of 0.14.

Limitations: We observed a significant decrease in

R^{2}

for GD-VAE under the cross-district and cross-year settings in Turkey compared to the cross-district setting alone. This primarily stems from the combined effects of two factors: increased cross-district data heterogeneity and heightened temporal complexity across years. Specifically, cross-district prediction must simultaneously address significant differences in climate conditions, soil properties, and agricultural management practices across districts (e.g., variations in sunlight and precipitation patterns between the Aegean and Anatolian districts), while the cross-year extension further introduces the coupled effects of climate interannual variability (e.g., changes in the frequency of drought events) and long-term trends (e.g., the declining trend in annual average PM2.5 concentrations) [36]. This dual complexity in both temporal and spatial dimensions exacerbates the distribution shift in input data, making it difficult for the model to learn stable global feature mappings through the training set. In the future, dynamic feature decoupling mechanisms [37] and meta-learning strategies [11] can be employed to explicitly separate district-specific attributes from time-varying factors, thereby enhancing the robustness of modeling in complex spatiotemporal scenarios. Additionally, we found that the effectiveness of the proposed data augmentation depends on the quality of the original data. Therefore, in cases where the original data is of poor quality, the proposed data augmentation may introduce more noise, leading to deteriorating model learning performance. Furthermore, the VAE model performs poorly without data augmentation, primarily due to its larger model parameters compared to networks such as MLP, Bi-LSTM, and CNN. Therefore, future improvements could explore how to effectively utilize data to optimize model parameters [38] rather than relying on data augmentation. We conclude by inviting and urging other researchers to complete our efforts by providing innovative methodologies that can be used to reliably predict cotton yields in data -scarce areas.

Author Contributions

Conceptualization, methodology, validation, writing—review and editing, and resources, Y.L., X.W., and L.G.; formal analysis, Y.L. and X.W.; investigation, data curation, supervision, project administration, and funding acquisition, Y.L. and L.G.; writing—original draft preparation and visualization, Y.L. and X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key Technology Development for Agricultural Condition Parameter Acquisition and Integrated Application of Sensing Equipment (2022LQ02004), the Science and Technology Development Program of the Pilot Zone for Innovation Driven Development along the Silk Road Economic Belt and the Wu-Chang-Shi National Innovation Demonstration Zone (2023LQJ03), and the ‘Tianshan Talent’ Training Program: Research on Carbon Sink Function of Cotton and Carbon Label Creation of Cotton Textile under the Background of Carbon Peak and Carbon Neutrality (2023TSYCCX0020).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available for download at the following web links: https://github.com/11124asda/GD-VAE (accessed on 1 August 2025).

Acknowledgments

This work was supported by three projects led by Lei Gao, namely Key Technology Development for Agricultural Condition Parameter Acquisition and Integrated Application of Sensing Equipment (2022LQ02004), Science and Technology Development Program of the Pilot Zone for Innovation Driven Development along the Silk Road Economic Belt and the Wu-Chang-Shi National Innovation Demonstration Zone (2023LQJ03), and ‘Tianshan Talent’ Training Program: Research on Carbon Sink Function of Cotton and Carbon Label Creation of Cotton Textile under the Background of Carbon Peak and Carbon Neutrality ( 2023TSYCCX0020), which are highly appreciated.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Khan, H.; Khan, N.; Khan, Z.; Han, Y.; Yang, B.; Lei, Y.; Zhi, X.; Xiong, S.; Shang, S.; Ma, Y.; et al. Water and heat resource utilization influence cotton yield through sowing date optimization under varied climate. Agric. Water Manag. 2025, 313, 109491. [Google Scholar] [CrossRef]
Chen, X.; Qi, Z.; Gui, D.; Gu, Z.; Ma, L.; Zeng, F.; Li, L. Simulating impacts of climate change on cotton yield and water requirement using RZWQM2. Agric. Water Manag. 2019, 222, 231–241. [Google Scholar] [CrossRef]
Liu, S.; Zhang, W.; Shi, T.; Li, T.; Li, H.; Zhou, G.; Wang, Z.; Ma, X. Increasing exposure of cotton growing areas to compound drought and heat events in a warming climate. Agric. Water Manag. 2025, 308, 109307. [Google Scholar] [CrossRef]
Subramanian, K.; Sarkar, M.K.; Wang, H.; Qin, Z.H.; Chopra, S.S.; Jin, M.; Kumar, V.; Chen, C.; Tsang, C.W.; Lin, C.S.K. An overview of cotton and polyester, and their blended waste textile valorisation to value-added products: A circular economy approach–research trends, opportunities and challenges. Crit. Rev. Environ. Sci. Technol. 2022, 52, 3921–3942. [Google Scholar] [CrossRef]
Zhang, Z.; Huang, J.; Yao, Y.; Peters, G.; Macdonald, B.; La Rosa, A.D.; Wang, Z.; Scherer, L. Environmental impacts of cotton and opportunities for improvement. Nat. Rev. Earth Environ. 2023, 4, 703–715. [Google Scholar] [CrossRef]
Ahmad, S.; Ahmad, I.; Ahmad, B.; Ahmad, A.; Wajid, A.; Khaliq, T.; Abbas, G.; Wilkerson, C.J.; Hoogenboom, G. Regional integrated assessment of climate change impact on cotton production in a semi-arid environment. Clim. Res. 2023, 89, 113–132. [Google Scholar] [CrossRef]
Rajput, P.K. Machine learning approach for Forest Biomass Modelling with In-Situ and Remote Sensing Data in Narmadapuram central India. Model. Earth Syst. Environ. 2025, 11, 350. [Google Scholar] [CrossRef]
Chen, W.; Xiang, X.; Liu, S.; Guo, J.; Li, T.; Zhou, X.; Peng, D.; Deng, Z.; Wang, B.; Wang, H.; et al. An integrated exergy efficiency and machine learning method for optimizing organic solid waste gasification process. Eng. Appl. Artif. Intell. 2025, 159, 111805. [Google Scholar] [CrossRef]
Ogbonna, C.; Ohabuka, C.; Bartholomew, D.; Anyiam, K.; Adamu, I. Optimizing Nigerian Bank Lending Systems: The Power of Discrete Wavelet Transform (DWT) in Denoising and Regression Analysis. Ann. Data Sci. 2025, 1–37. [Google Scholar] [CrossRef]
Menghani, G. Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
Yang, S.; Du, Y.; Zheng, X.; Li, X.; Chen, X.; Li, Y.; Xie, C. Few-shot intent detection with self-supervised pretraining and prototype-aware attention. Pattern Recognit. 2024, 155, 110641. [Google Scholar] [CrossRef]
Archana, R.; Jeevaraj, P.E. Deep learning models for digital image processing: A review. Artif. Intell. Rev. 2024, 57, 11. [Google Scholar] [CrossRef]
Wang, H.; Wang, H. Research on Microseismic Magnitude Prediction Method Based on Improved Residual Network and Transfer Learning. Appl. Sci. 2025, 15, 8246. [Google Scholar] [CrossRef]
Ma, T.; Yu, J.; Wang, B.; Gao, M.; Yang, Z.; Li, Y.; Fan, M. A Power Monitor System Cybersecurity Alarm-Tracing Method Based on Knowledge Graph and GCNN. Appl. Sci. 2025, 15, 8188. [Google Scholar] [CrossRef]
Deng, G.; Zhou, F.; Dong, H.; Xu, Z.; Li, Y. Accurate Sugarcane Detection and Row Fitting Using SugarRow-YOLO and Clustering-Based Spline Methods for Autonomous Agricultural Operations. Appl. Sci. 2025, 15, 7789. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, L.; Yu, H.; Guo, Z.; Zhang, R.; Zhou, X. Research on the Strawberry Recognition Algorithm Based on Deep Learning. Appl. Sci. 2023, 13, 11298. [Google Scholar] [CrossRef]
Wang, H.; Dai, Y.; Yao, Q.; Ma, L.; Zhang, Z.; Lv, X. Multi-task learning model driven by climate and remote sensing data collaboration for mid-season cotton yield prediction. Field Crops Res. 2025, 333, 110070. [Google Scholar] [CrossRef]
Yu, S.H.; Kang, Y.; Lee, C.G. Comparison of the Spray Effects of Air Induction Nozzles and Flat Fan Nozzles Installed on Agricultural Drones. Appl. Sci. 2023, 13, 11552. [Google Scholar] [CrossRef]
Li, N.; Li, Y.; Yang, Q.; Biswas, A.; Dong, H. Simulating climate change impacts on cotton using AquaCrop model in China. Agric. Syst. 2024, 216, 103897. [Google Scholar] [CrossRef]
Shin, H.J.; Kim, S.; Kang, H.; Lee, A.G. Novel Instrument for Clinical Evaluations of Active Extraocular Muscle Tension. Appl. Sci. 2023, 13, 11431. [Google Scholar] [CrossRef]
Istipliler, D.; Ekizoğlu, M.; Çakaloğulları, U.; Tatar, Ö. The impact of environmental variability on cotton fiber quality: A comparative analysis of primary cotton-producing regions in türkiye. Agronomy 2024, 14, 1276. [Google Scholar] [CrossRef]
Alawneh, L.; Alsarhan, T.; Al-Zinati, M.; Al-Ayyoub, M.; Jararweh, Y.; Lu, H. Enhancing human activity recognition using deep learning and time series augmented data. J. Ambient. Intell. Humaniz. Comput. 2021, 12, 10565–10580. [Google Scholar] [CrossRef]
Gao, T.; Yao, X.; Chen, D. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, Punta Cana, Dominican Republic, 7–11 November 2021. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Xu, W.; Chen, P.; Zhan, Y.; Chen, S.; Zhang, L.; Lan, Y. Cotton yield estimation model based on machine learning using time series UAV remote sensing data. Int. J. Appl. Earth Obs. Geoinf. 2021, 104, 102511. [Google Scholar] [CrossRef]
Knutti, R.; Rugenstein, M.A. Feedbacks, climate sensitivity and the limits of linear models. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2015, 373, 20150146. [Google Scholar] [CrossRef]
Pabuayon, I.L.B.; Kelly, B.R.; Mitchell-McCallister, D.; Coldren, C.L.; Ritchie, G.L. Cotton boll distribution: A review. Agron. J. 2021, 113, 956–970. [Google Scholar] [CrossRef]
Krichen, M. Generative Adversarial Networks. In Proceedings of the 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 6–8 July 2023; pp. 1–7. [Google Scholar]
Grundy, P.R.; Yeates, S.J.; Bell, K.L. Cotton production during the tropical monsoon season. I—The influence of variable radiation on boll loss, compensation and yield. Field Crops Res. 2020, 254, 107790. [Google Scholar] [CrossRef]
Huete, A. A soil-adjusted vegetation index (SAVI). Remote Sens. Environ. 1988, 25, 295–309. [Google Scholar] [CrossRef]
Khan, M.A.; Anwar, S.; Abbas, M.; Aneeq, M.; de Jong, F.; Ayaz, M.; Wei, Y.; Zhang, R. Impacts of climate change on cotton production and advancements in genomic approaches for stress resilience enhancement. J. Cotton Res. 2025, 8, 17. [Google Scholar] [CrossRef]
Xu, W.; Yang, W.; Chen, P.; Zhan, Y.; Zhang, L.; Lan, Y. Cotton Fiber Quality Estimation Based on Machine Learning Using Time Series UAV Remote Sensing Data. Remote Sens. 2023, 15, 586. [Google Scholar] [CrossRef]
Liu, Q.; Wang, C.; Jiang, J.; Wu, J.; Wang, X.; Cao, Q.; Tian, Y.; Zhu, Y.; Cao, W.; Liu, X. Multi-source data fusion improved the potential of proximal fluorescence sensors in predicting nitrogen nutrition status across winter wheat growth stages. Comput. Electron. Agric. 2024, 219, 108786. [Google Scholar] [CrossRef]
Yu, T.; Wang, B.; Li, X.; Yu, Y. A Tensor Decomposition-Based Censored Regression Adaptive Filtering Algorithm. Circuits Syst. Signal Process. 2025, 44, 6151–6166. [Google Scholar] [CrossRef]
Chang, W.; Yang, S.; Xi, X.; Wang, H.; Liu, Z.; Zhang, X.; Li, S.; Zhao, Y. Classification of seed maize using deep learning and transfer learning based on times series spectral feature reconstruction of remote sensing. Comput. Electron. Agric. 2025, 237, 110738. [Google Scholar] [CrossRef]
Kamangir, H.; Hajiesmaeeli, M.; Earles, J.M. California Crop Yield Benchmark: Combining Satellite Image, Climate, Evapotranspiration, and Soil Data Layers for County-Level Yield Forecasting of Over 70 Crops. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 11–15 June 2025; pp. 5491–5500. [Google Scholar]
Lu, W.; Chen, S.B.; Shu, Q.L.; Tang, J.; Luo, B. Decouplenet: A lightweight backbone network with efficient feature decoupling for remote sensing visual tasks. IEEE Trans. Geosci. Remote Sens. 2024, 62. [Google Scholar] [CrossRef]
Cai, Z.; An, X.; Xie, D.; Xue, Y.; Liu, X.; Wang, Q.; Chen, L.; Liu, L.; Zhang, C.; Xue, C. An attitude control method with model-aided estimation and parameter-adaptive optimization for high clearance sprayers. Comput. Electron. Agric. 2025, 237, 110572. [Google Scholar] [CrossRef]

Figure 1. The distribution of the mean NDVI values of cotton data in the Bahawalnagar district from 1999 to 2021, containing data for five months per year. The abscissa represents all the years of the data. It can be observed that there is a large distribution difference in relation to the data after 2016.

Figure 2. The distribution of characteristics of Turkish cotton data. The abscissa represents the value of each indicator, and the ordinate represents the number of each indicator.

Figure 3. The spatial distribution of different indicators in different districts or years. The abscissa represents different districts, and the ordinate represents different years. It contains data for five months of each year, with the white areas indicating none for that year. The figure mainly shows the district changes of different indicators in different years. From this, outliers in the same indicators in different districts can be observed, as shown by the highlighted values in the figure.

Figure 4. The correlations of several important characteristics. The abscissa and ordinate represent different indicators, and the value of each indicator is the average of all data. It can be observed that the correlations among most of the features are relatively low, indicating that there are more complex relationships between them.

Figure 5. The linear correlation between the characteristics and the yield. The abscissa represents the value of the indicator, and the ordinate represents the value of the yield. From this, it can be observed that there is almost no linear relationship between each feature and the output.

Figure 6. The linear correlation between the characteristics and the yield. From this, it can be observed that there is almost no linear relationship between each feature and the output.

Figure 7. Experimental design process. This includes the proposed method execution process, baseline methods for comparison, and evaluation design and metrics.

Figure 8. The comparison results of RMSE, MAE, and

R^{2}

. The abscissa represents the different method models, and the ordinate represents the evaluation index scores. Cross-district and Cross-district and -year indicate results for Turkey. The first row (a–c) presents the three calculated RMSE results. The second row (d–f) presents the results of three set MAE values. The third row (g–i) presents the results of three different

R^{2}

settings. Among all the figures, the one on the far right represents the results of the GD-VAE model proposed by us. The red dotted line indicates the best outcome.

Figure 8. The comparison results of RMSE, MAE, and

R^{2}

. The abscissa represents the different method models, and the ordinate represents the evaluation index scores. Cross-district and Cross-district and -year indicate results for Turkey. The first row (a–c) presents the three calculated RMSE results. The second row (d–f) presents the results of three set MAE values. The third row (g–i) presents the results of three different

R^{2}

settings. Among all the figures, the one on the far right represents the results of the GD-VAE model proposed by us. The red dotted line indicates the best outcome.

Figure 9. The linear relationship between the predicted yield and the actual yield under the cross-year setting of Bahawalnagar. The abscissa represents the production predicted by the model, and the ordinate represents the actual production. The number of all samples tested across years in Bahawalnagar is 5. (a–j) correspond to LR, RFR, SVR, RR, La, MLP, Multi-scale CNN, Bi-LSTM, Transformer, and GD-VAE (our model). As all the scattered points become closer to the red diagonal dotted line, the linear relationship becomes better. The last two images are the same.

Figure 10. The linear relationship between the predicted yield and the actual yield under the cross-district setting of Turkey. The number of all samples for the cross-district test in Turkey is 72. (a–j) correspond to LR, RFR, SVR, RR, La, MLP, Multi-scale CNN, Bi-LSTM, Transformer, and GD-VAE (our model).

Figure 11. The linear relationship between the predicted yield and the actual yield under both cross-district and -year settings of Turkey. The number of all samples tested cross-district and -year in Turkey is 37. (a–j) correspond to LR, RFR, SVR, RR, La, MLP, Multi-scale CNN, Bi-LSTM, Transformer, and GD-VAE (our model).

Figure 12. Feature importance analysis. The abscissa represents the removed indicators, and the ordinate represents the evaluation results. ‘All’ means no indicators have been removed. Here, (a,b) represent the importance of the indicators in the Bahawalnagar dataset, while (c,d) represent the Turkey dataset.

Figure 13. Distributions of data augmentation sampling. The abscissa and ordinate represent the values of the 0 and 1 indices of the data after dimensionality reduction. The stars represent the original data, the circles represent the sampled data, and the different colors represent the data of different years. For each sample, 100 new samples were generated through data augmentation. (a) represents the sampling distributions of Bahawalnagar data, (b) represents the cross-district sampling distributions of Turkey data, and (c) represents both the cross-district and -year sampling distribution of Turkish data.

Table 1. District sources of cotton production data for Pakistan and Turkey.

Country

Districts

Pakistan

Bahawalpur

Turkey

Ceyhan, Karatas, Yuregir, Incirliova, Germencik, Kocarli, Nazilli, Soke,
Yenipazar, Bismil, Cinar, Sur, Yenisehir, Antakya, Kirikhan, Kumlu, Reyhanli,
Bergama, Kinik, Akhisar, Saruhanli, Derik, Kiziltepe, Akcakale, Bozova,
Ceylanpinar, Harran, Karakopru, Suruc, Menderes, Dikili, Turgutlu, Altinozu,
Yumurtalik, Sultanhisar, Cine, Efeler, Kosk, Tarsus, Foca, Menemen, Tire,
Torbali, Sehzadeler, Golmarmara, Yunusemre, Eyyubiye, Hilvan, Siverek,
Viransehir, Bayindir, Ahmetli, Seyhan, Kayapinar, Selcuk, Didim, Kuyucak,
Salihli, Kirkagac, Haliliye, Aliaga, Baglar, Buharkent, Osmaniye, Hassa,
Kadirli, Ergani, Artuklu, Imamoglu, Soma, Kozan, Saricam, Cermik, Cigli,
Odemis, Savur, Bozdogan, Mazidagi, Silvan, Egil, Alasehir

Table 2. Data statistics information. Here, “d” represents the data dimension, which refers to the characteristics, indices, or indicators of the cotton field records.

Field	Setting	Characteristic, Index or Indicator	Train Year	Test Year	Train District	Test District	Total Year
Bahawalnagar	Cross-year	45d	18 (1999–2016)	5 (2017–2021)	1	1	23 (1999–2021)
Turkey (81 districts)	Corss-district	20d	5 (2019–2023)	5 (2019–2023)	50	31	5 (2019–2023)
Turkey (81 districts)	Cross-district and -year	20d	3 (2019–2021)	2 (2022–2023)	50	31	5 (2019–2023)

Table 3. Comparison results with GAN network.

Method	Cross-Year			Cross-District			Cross-District and -Year
Method	RMSE	MAE	$R^{2}$	RMSE	MAE	$R^{2}$	RMSE	MAE	$R^{2}$
GAN	98.94	72.83	−0.01	57.13	45.83	−0.20	47.84	40.92	0.09
GD-VAE	58.40	38.19	0.65	38.29	30.34	0.49	46.46	37.74	0.14

Table 4. The

R^{2}

result for the importance feature. When “All” is used, it indicates that all the indicators are employed. Otherwise, it represents the average value of all

R^{2}

after excluding one indicator.

Table 4. The

R^{2}

result for the importance feature. When “All” is used, it indicates that all the indicators are employed. Otherwise, it represents the average value of all

R^{2}

after excluding one indicator.

Method	All Bahawalnagar	Bahawalnagar	All Turkey	Turkey
RFR	0.45	0.43	−1.40	−1.50
GD-VAE	0.65	0.55	0.14	0.07

Table 5. Results of ablation analysis.

Method	Cross-Year			Cross-District			Cross-District and -Year
Method	RMSE	MAE	$R^{2}$	RMSE	MAE	$R^{2}$	RMSE	MAE	$R^{2}$
w/o GD	116.05	95.79	−0.39	70.04	53.66	−0.66	80.69	63.67	−1.58
GD-VAE	58.4	38.19	0.65	38.59	30.34	0.49	46.46	37.74	0.14

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lan, Y.; Wang, X.; Gao, L.; Chen, X. Cotton Yield Prediction with Gaussian Distribution Sampling and Variational AutoEncoder. Appl. Sci. 2025, 15, 9947. https://doi.org/10.3390/app15189947

AMA Style

Lan Y, Wang X, Gao L, Chen X. Cotton Yield Prediction with Gaussian Distribution Sampling and Variational AutoEncoder. Applied Sciences. 2025; 15(18):9947. https://doi.org/10.3390/app15189947

Chicago/Turabian Style

Lan, Yaqi, Xiudong Wang, Lei Gao, and Xiaoliang Chen. 2025. "Cotton Yield Prediction with Gaussian Distribution Sampling and Variational AutoEncoder" Applied Sciences 15, no. 18: 9947. https://doi.org/10.3390/app15189947

APA Style

Lan, Y., Wang, X., Gao, L., & Chen, X. (2025). Cotton Yield Prediction with Gaussian Distribution Sampling and Variational AutoEncoder. Applied Sciences, 15(18), 9947. https://doi.org/10.3390/app15189947

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cotton Yield Prediction with Gaussian Distribution Sampling and Variational AutoEncoder

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Areas

2.2. Datasets

2.3. Methods

2.3.1. Gaussian Distribution Data Augmentation

2.3.2. Variational AutoEncoder

2.3.3. Experiment Setup

3. Results and Discussion

3.1. Contrastive Analysis

3.2. Importance Feature Analysis

3.3. Data Augmentation Effectiveness Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI