1. Introduction
With the growth of the global population, the shortage of freshwater resources is becoming increasingly severe. However, worldwide, agriculture accounts for 70% of total water usage; of that, approximately 60% is lost through evaporation. The prediction of evapotranspiration (ET
0) is crucial for calculating agricultural irrigation requirements, as well as for the design, optimization, and management of irrigation systems and water resources [
1,
2,
3,
4]. The widely recognized standard for calculating ET
0 is the FAO-56 Penman–Monteith equation [
5], proposed by the Food and Agriculture Organization (FAO), which has been extensively used for calculating reference evapotranspiration (ET
0) values [
6,
7,
8,
9]. However, the FAO-56 Penman–Monteith equation requires numerous meteorological variables, many of which are often incomplete or missing, particularly in developing countries. This limitation restricts the direct application of FAO-56 [
10,
11]. As a result, researchers have developed and calibrated various simplified models and empirical formulas for estimating evapotranspiration, such as the Makkink, Priestley–Taylor, and Hargreaves models. However, evaluations in practical applications have shown that these models often exhibit significant deviations in prediction accuracy [
12,
13,
14], and many coefficients require region-specific adjustments [
15]. Therefore, there is a need to identify methods that can more accurately predict ET
0 under conditions with limited meteorological data.
In recent years, with the rapid development of machine learning, it has been widely applied to predicting evapotranspiration. Many common machine learning algorithms, such as Artificial Neural Network (ANN) [
16], Random Forest (RF) [
6,
17], Support Vector Machine (SVM) [
17,
18,
19], Adaptive Neuro-Fuzzy Inference Systems (ANFIS) [
16,
18], Multivariate Adaptive Regression Splines (MARS) [
18], and M5P [
17], have been used to model and predict ET
0, yielding promising results. Among these, the Extreme Learning Machine (ELM) model has garnered significant attention in academia and industry due to its fast and efficient computation, excellent generalization ability, and strong adaptability [
20]. Abdullah et al. (2015) were the first to apply the ELM model to predict evapotranspiration. They used the ELM model and the Feedforward Backpropagation Neural Network (FFBP) to predict ET
0 at three sites across Iraq. Their study found that the ELM model exhibited excellent predictive capability for daily ET
0 outperformed the FFBP model in prediction accuracy and efficiency, even when incomplete meteorological data were used. They strongly recommended the ELM model for evapotranspiration prediction [
21].
Since then, many researchers have attempted to use the ELM model to predict evapotranspiration in different regions and under various input conditions, all finding that it delivers superior performance in ET
0 prediction. For example, Kumar et al. (2016) used ELM, ANN, Genetic Programming (GP), and SVM to model and predict ET
0 for Pusa in Bihar, India. They found that ELM achieved significantly higher prediction accuracy than the other three models with much shorter computation times [
22]. Similarly, Fan et al. (2018) used SVM, ELM, and four tree-based ensemble models to model and predict daily evapotranspiration for different climatic regions in China. Their results showed that SVM and ELM had higher prediction accuracy and stability than the other models, with the ELM model slightly outperforming the SVM model [
17].
In machine learning, a model’s performance is closely related to the setting of its parameters. Optimization algorithms can continuously refine the parameters by searching for the optimal solution, significantly improving the predictive accuracy of machine learning models. Therefore, optimization algorithms are widely used in conjunction with machine learning models, and their advantages are continuously validated in practice [
23,
24,
25]. The ELM model does not require iterations compared to other machine learning models. The connection weights between the input layer and the hidden layer, as well as the thresholds of the hidden layer, are randomly generated, which gives ELM a breakneck computational speed. However, this also means that ELM may not consistently achieve optimal performance and leaves room for further optimization. As a result, many optimization algorithms have been applied to enhance ELM. In the field of evapotranspiration prediction, Particle Swarm Optimization (PSO) has been repeatedly shown to be one of the most effective methods for optimizing ELM, making it a popular and widely used model in evapotranspiration prediction.
For example, Zhu et al. (2020) optimized the ELM model using PSO for evapotranspiration modeling and prediction in northwestern China’s temperate continental climate region. They compared the performance of the PSO-ELM model with the original ELM, ANN, RF, and six empirical models regarding prediction accuracy. The results indicated that PSO enhanced the performance of ELM, outperforming other machine learning and empirical models [
12]. Similarly, Wu et al. (2021) used Genetic Algorithm (GA), PSO, and Artificial Bee Colony (ABC) algorithms to optimize ELM for ET
0 prediction at 12 representative stations across five different climatic regions in China. Their results showed that all optimization algorithms significantly improved the prediction accuracy of ELM, with the PSO-ELM model performing the best [
13]. Additionally, Shi et al. (2023) optimized the ELM model using GA, PSO, and Salp Swarm Algorithm (SSA) to predict ET
0 at 23 stations in China. Their findings demonstrated that the PSO-ELM model had the highest prediction accuracy and most excellent applicability [
26]. These studies highlight the significant performance improvements achieved by using PSO to optimize ELM for ET
0 prediction, establishing PSO-ELM as a practical and superior approach.
However, the traditional PSO algorithm has some drawbacks, such as the tendency to get stuck in local optima in complex multimodal problems, the lack of randomness in particle position updates, and the rapid decrease in population diversity as iterations progress, which negatively impacts its global search capability. To address these limitations, the Quantum Particle Swarm Optimization (QPSO) algorithm, which incorporates quantum theory, has been proposed to enhance the diversity of the population, improve global search ability, and balance global and local searches, thus reducing the likelihood of getting trapped in local optima [
27]. QPSO has been applied in many fields and has improved performance [
28,
29,
30]. In recent years, the improved Kernel Extreme Learning Machine (KELM) model has been proposed. It introduces a kernel function into the ELM model, which maps the data to a higher-dimensional space, thereby enhancing the ELM model’s ability to handle nonlinear data. Since the factors involved in evapotranspiration (ET
0) prediction are complexly interrelated and highly nonlinear, this approach has also been applied to ET
0 prediction and has yielded promising results [
31,
32]. However, using a single kernel function limits its adaptability to handle complex and diverse data types. To overcome this, multi-kernel functions can be introduced, which, by optimizing the weightings of different kernels, further enhance the model’s ability and adaptability to handle data with different features [
33,
34,
35,
36].
In the KELM model, computing the kernel matrix becomes very time-consuming when the sample size is large. By applying clustering algorithms to group-related samples and building models for each group, the computational load of the kernel matrix can be significantly reduced, resulting in lower computational costs [
32]. Therefore, the main objectives of this study are as follows: (1) To propose an improved QPSO-MKELM model based on the PSO-ELM model, then introduce the K-means-QPSO-MKELM model. The model’s prediction accuracy and computation time will be compared for different numbers of clusters to determine the optimal number of clusters. (2) To conduct ablation experiments on the proposed K-means-QPSO-MKELM model. (3) To compare the prediction accuracy and computation time of K-means-QPSO-MKELM, QPSO-MKELM, PSO-ELM, RF, Whale Optimization Algorithm–Support Vector Regression (WOA-SVR), and ANFIS.
2. Materials and Methods
2.1. Study Area and Data Information
The study area is located in Yancheng, China, within the Huaihe River Basin, at a latitude of 32.75° and a longitude of 120.25°, with an elevation of 3 m. Yancheng is home to the Dafeng Farm, one of China’s significant agricultural regions. The dataset includes temperature, humidity, wind speed, sunshine duration, and precipitation. It was obtained from the China Meteorological Data Service Center and has undergone strict manual review and rigorous quality control, ensuring high reliability with no prolonged missing periods. No extreme outliers were detected upon inspection, and the missing data rate was 0.26%. Missing values were imputed using linear interpolation. The dataset consists of daily meteorological data from 1980 to 2023, spanning 44 years. Among these, data from 1980 to 2018 (39 years) were used for training the model, while data from 2019 to 2023 (5 years) were used for testing the model.
2.2. FAO-56 Penman–Monteith Model
The FAO-56 model, proposed by Allen et al. (1998), is a formula-based model for calculating evapotranspiration [
5] and is known for its high computational accuracy. Since no experimental ET
0 data are available in the study area, FAO-56 was used to calculate ET
0 as the reference target output for machine learning models. This approach is acceptable and commonly employed under such circumstances [
37]. The specific formula is as follows:
where ET
0 (mm/d) is the reference evapotranspiration, R
n (MJ/m
2/d) is the net radiation, G (MJ/m
2/d) is the soil heat flux density, T
means (°C) is the mean air temperature, e
a (kPa) is the actual vapor pressure, e
s(kPa) is the saturated vapor pressure, Δ
is the slope of the vapor pressure curve, γ
is the psychrometric constant, and U
2 (m/s) is the wind speed at 2 m above the ground. Allen et al. (1998) provides the detailed calculation process of this model [
5]. Additionally, radiation is calculated from sunshine duration using the Angström equation [
38].
2.3. Correlation Analysis and Variance
Evapotranspiration (ET0) has a complex nonlinear relationship with various meteorological factors, and among these, the FAO-56 model, while highly accurate, requires many meteorological inputs. However, these inputs are often incomplete, which limits the model’s applicability. In contrast, machine learning can build models and make predictions even when data are incomplete. It is important to note that different meteorological data contain varying amounts of information about evapotranspiration. Thus, we can perform correlation and variance analyses to preliminarily select features, providing a valuable reference for constructing the final model.
Correlation analysis is used to study the relationships between variables or between variables and outputs. This study employs the maximal information coefficient (MIC) [
39], a correlation measure from information theory. MIC can capture both linear relationships between features and evapotranspiration, as well as complex nonlinear relationships among them.
Variance is a statistical measure that describes the degree of data dispersion and quantifies how much data points deviate from the mean. The variance of a feature reflects its numerical variation range. However, variances cannot be directly used to measure information content due to differences in units and scales among features. By computing the variance of normalized features, the variation of different features can be compared on the same scale, enabling the quantification of the information contained in each feature; the larger the normalized variance, the more information the feature carries.
2.4. New Model
2.4.1. K-Means Clustering
K-means is a widely used and simple clustering algorithm. It divides the dataset into k clusters, aiming to minimize the distance between data points within each cluster so that data points in the same cluster are more similar to each other than to points in different clusters.
Its principle and steps are as follows: first, the number of clusters k is set. Then, k data points are randomly selected as centroids. The data points are assigned to the nearest centroid’s cluster by calculating the Euclidean distance. The mean of the data points within each cluster is then recalculated to serve as the new centroid. The process of reassigning points to clusters and recalculating centroids is repeated until the change in centroids is below a set threshold, at which point the iteration stops. Once the iteration ends, all data points are assigned to one of the k clusters, completing the clustering process.
The K-means clustering algorithm operates in the original feature space, and its calculation is straightforward and efficient. It is suitable for large datasets and can perform clustering with relatively low computational cost. However, it also has some drawbacks. The value of k must be set in advance, and in practical applications, this value can be challenging to determine. Accurately setting k requires a good understanding of the dataset. In real-world applications, the value of k is usually determined by trying different values and comparing the clustering results.
2.4.2. MKELM
The ELM [
40] is a single hidden-layer neural network proposed by Huang, GB, in 2006. As shown in
Figure 1, its network structure consists of the input layer, hidden layer, and output layer. Compared to traditional BP neural networks and support vector machines, the weights between the hidden layer and the output layer are determined by solving the generalized inverse matrix. This eliminates the need to adjust connection weights, resulting in breakneck computation speed.
Let x
1 to x
k represent the k-dimensional input of the sample and y
1 to y
n represent the n-dimensional target output. An ELM with L hidden layer neurons can be expressed as:
Let X be the input matrix and ω be the connection weights randomly generated between the input nodes and the hidden layer nodes. bi represents the bias of the i-th hidden layer node. G(x) is the activation function of the hidden layer.
The equation can be expressed as:
where H is the feature mapping from the k-dimensional input space to the L-dimensional space, representing the hidden layer’s output matrix, and
is the weight vector from the hidden layer to the output layer. By applying regularized least squares to minimize the loss function, the solution is obtained:
The least squares solution can easily lead to overfitting, which limits the model’s generalization ability. Therefore, a regularization parameter C is introduced to reduce the risk of overfitting.
The output weights
are then obtained as:
Kernel functions are introduced to enhance the performance of the ELM, leading to the KELM algorithm [
41]. The main difference between KELM and ELM is that the kernel function, through kernel mapping, increases the dimensionality of the data, mapping linearly non-separable data in a low-dimensional space to a higher-dimensional space where it becomes linearly separable.
The kernel matrix can be defined as:
The kernel matrix
based on the kernel function replaces the HH
T, h(x) is the output matrix of the hidden layer, and K(x
i,x
j) represents the kernel function. The output of the KELM can be expressed as:
Random mapping is replaced with kernel mapping, which eliminates the need to set the number of hidden layer nodes during the initialization phase. Additionally, there is no need to set the hidden layer node weights and thresholds, effectively improving the generalization ability and stability that could be compromised by random initialization of hidden layer weights.
The KELM introduces kernel functions, enhancing the ELM performance when dealing with nonlinear problems. However, the kernel function in KELM is a single, static function, which limits its ability to handle complex data structures and its adaptability to different data types and features [
33,
34,
35,
36]. Therefore, this paper introduces multiple kernel functions, which are combined linearly to form a final kernel function. The weight of each kernel function is optimized through an algorithm. This enables the model to adapt to data characteristics from multiple dimensions and perspectives, improving the model’s learning and generalization abilities. As a result, the model demonstrates more stable performance across various scenarios.
This study selects the following kernel functions to construct the multiple kernel function: RBF (Radial Basis Function) kernel, linear kernel, and polynomial kernel. The linear kernel can capture the linearly separable parts of the data well without increasing the model’s complexity. The RBF kernel maps the data into an infinite-dimensional space, effectively handling linearly non-separable data structures in the original space. The polynomial kernel is capable of handling high-dimensional polynomial structures in the data. These kernels are combined, and their weights are optimized.
The parameters that need to be optimized in the MKELM include the width of the Gaussian kernel (σ), the degree of the polynomial kernel (d), the regularization parameter C, and the weights of the three kernel functions, namely c1, c2, and c3, where c1 + c2 + c3 = 1.
2.4.3. Piecewise QPSO Improved by GA
The traditional PSO algorithm updates the velocity and position of the particles in the next generation based on the particle’s velocity inertia, individual best, and global best, which carries a significant risk of getting stuck in local optima. To address this issue, Jun Sun et al. introduced quantum theory into the Particle Swarm Optimization algorithm, resulting in the QPSO algorithm [
27]. In the QPSO algorithm, the particle’s position update formula is as follows:
where
is the local attraction point for the i-th particle, and its formula is as follows:
Pi (t) is the individual best position of the i-th particle at the t-th iteration, and Pg(t) is the global best position at the t-th iteration.
(t) is the shrinkage-expansion coefficient; N
best(t) is the average position of all particles’ individual best positions at the t-th iteration;
and
are random numbers uniformly generated within the range [0, 1]. The formula for calculating N
best(t) is as follows:
The formula for calculating
(t) is as follows:
where t is the current iteration number, and T is the maximum number of iterations. The local attraction point,
, is important in the particle position update. It is a linear combination of the individual and global best positions.
is a random number in the range [0, 1]. Although the particle can integrate individual and population information to update its position, the search lacks emphasis on different stages. This paper proposes a piecewise attraction point, which adjusts the emphasis on the dominance of individual and global best in different stages. The parameter
is bounded by a maximum value of
instead of 1. When
exceeds 0.5,
is a random number in the range [0.5,
]. Otherwise, it is in the range [0,
]. The formula for
is as follows:
At the early stages of the algorithm, has a higher probability of selecting larger values, and the individual best dominates in the attraction point, improving the particle’s global search ability and avoiding premature convergence to local optima. In the later stages, the probability of selecting smaller values for increases, and the global best dominates in the attraction point.
The improvement of the piecewise attraction point enhances the global search ability of the particle. However, there is still a possibility of getting stuck in local optima during the particle convergence phase. To address this, the idea of genetic algorithm mutation is introduced. A piecewise mutation strategy is proposed. A minimal mutation probability is applied in the early stages when the particles are still relatively dispersed. This ensures that the search and convergence of the particles are not affected. In the later stages, when the particles converge to a certain extent, and the population becomes highly similar, the mutation rate increases. At this point, increasing the mutation rate has little effect on the convergence trend of the population but greatly improves the ability of the population to escape from local optima, further enhancing the model’s global search ability and stability, and reducing the risk of the model getting trapped in local optima. The mutation rate λ is calculated as follows:
where λ is a random number in the range of [0, 0.1
], when
< 0.09; otherwise, λ is a random number in the range of [0.09,
].
2.4.4. QPSO-MKELM Model and Kmeans-QPSO-MKELM Model
The QPSO-MKELM model is constructed by optimizing the hyperparameters of the MKELM using a Segmented QPSO improved by GA. These hyperparameters include the Gaussian kernel width (σ), polynomial kernel degree (d), regularization parameter C, and the weight parameters of the three kernel functions (c1,c2,c3).
The Kmeans-QPSO-MKELM model first divides the training set into k clusters using K-means clustering. The k clusters are then treated as k subsets of the original dataset. For each k subset, a QPSO-MKELM sub-model is trained, and all k sub-models are combined to form the final model. In this study, the Kmeans-QPSO-MKELM model adopts a weighted scheme during the testing phase. First, the Euclidean distance between each data point in the test set and the centroids of the k clusters is calculated. The two clusters whose centroids are closest to the test point are selected, and the corresponding sub-models are used to make predictions. The final prediction is then computed using a weighted averaging approach, where the weights are determined by the formula 1/(d^6). In this study, k was chosen as 10, 20, 30, and 40 for comparative experiments.
2.5. Hardware and Software Configuration & Program Execution Time
The machine learning models in this experiment were run on a computer with an Intel Core i5-1135G7 CPU, Intel Iris Xe Graphics GPU, and 16 GB RAM, using MATLAB R2023b software.
The model execution time reported in this study refers to the training time of the models, which is the average time taken from running each model five times.
2.6. Performance Comparison Criteria
This study uses R
2, MAE, RMSE, and (Global Performance Index) G
p [
26,
42,
43,
44] as evaluation metrics to assess the performance of different models. The specific calculation methods and formulas for R
2, MAE, and RMSE can be found in reference [
45]. The specific calculation formula for G
p [
26] is as follows:
where T
j is the normalized value of the three evaluation metrics (R
2, MAE, RMSE), and M
j is the median value of the corresponding evaluation metric. When T
j corresponds to R
2, α
j is 1; for other cases, α
j is −1.
The higher the R2 value, the better the model’s fit to the data, indicating better performance. The lower the RMSE and MAE, the better the model’s performance. The higher the Gp value, the better the overall performance of the model. Gp-Rank ranks the models based on their Gp values, from the highest to the lowest.
4. Discussion
This study first estimated the importance of meteorological features for predicting ET
0 based on the correlation between each meteorological feature and ET
0 and the correlation among the meteorological features themselves. The normalized variance values of the features were used to assess their significance, and important features were selected while redundant ones were excluded. The study identified temperature and sunshine duration as the two most important features for estimating evapotranspiration. All models achieved over 95% prediction accuracy in experiments when only temperature and sunshine duration were used as inputs. This is consistent with the findings of Zhao et al. (2023), who predicted evapotranspiration for 14 sites in southern China [
48]. This indicates that estimating ET
0 using only temperature and sunshine duration in southern China with limited meteorological conditions is reasonable. This finding provides valuable guidance for sensor selection and data collection in practical applications that require ET
0 prediction.
Considering that evapotranspiration varies uniformly and exhibits significant periodicity in the study region, the sine and cosine values of the date were added as inputs, resulting in a noticeable improvement in prediction accuracy across all models. Similarly, Hu et al. (2024) incorporated extraterrestrial radiation (Ra), a parameter that can be calculated from the date and latitude, into their model, and observed a significant improvement in prediction performance [
49]. Building upon the widely recommended PSO-ELM model for ET
0 prediction, this study proposes the Kmeans-QPSO-MKELM algorithm, which addresses the shortcomings of the original model. The new model improves prediction accuracy and achieves a shorter training time. At the research site, the new model outperformed the WOA-SVR model proposed by Mohammadi et al. (2020) for ET
0 prediction [
24], achieving higher prediction accuracy and shorter training time. The prediction accuracy of the new model was also higher than that of the traditional ANFIS and RF models.
At the same time, this study has several limitations and areas for improvement:
In this study, the Kmeans-QPSO-MKELM model used a linear kernel function, a polynomial kernel function, and an RBF kernel. Future research could compare the impact of different kernel functions on the model’s ability to capture meteorological features and predict ET0. Given that ET0 exhibits periodicity, periodic kernel functions could be included.
This study was conducted only in one location within a single climatic zone. Future studies should expand to different climatic zones to assess the model’s performance in varied conditions.
This study used the FAO-56 model to calculate the target values of ET0, which is a common practice. However, the conclusions drawn may not be robust enough, and therefore, future research should use experimental ET0 data as the target value for more reliable experiments.
The models compared in this study were limited. Future work could include comparisons with other models, such as deep learning models, to assess their performance.