Improving Solar Radiation Prediction in China: A Stacking Model Approach with Categorical Boosting Feature Selection

Ding, Yuehua; Wang, Yuhang; Li, Zhe; Zhao, Long; Shi, Yi; Xing, Xuguang; Chen, Shuangchen

doi:10.3390/atmos15121436

Open AccessArticle

Improving Solar Radiation Prediction in China: A Stacking Model Approach with Categorical Boosting Feature Selection

by

Yuehua Ding

¹,

Yuhang Wang

²,

Zhe Li

³,

Long Zhao

^1,*

,

Yi Shi

³,

Xuguang Xing

⁴ and

Shuangchen Chen

^1,*

¹

College of Horticulture and Plant Protection, Henan University of Science and Technology, Luoyang 471000, China

²

School of Energy Science and Engineering, Harbin Institute of Technology, 92, West Dazhi Street, Harbin 150001, China

³

College of Agricultural Equipment Engineering, Henan University of Science and Technology, Luoyang 471000, China

⁴

Key Laboratory for Agricultural Soil and Water Engineering in Arid Area of Ministry of Education, Northwest A&F University, Xianyang 712100, China

^*

Authors to whom correspondence should be addressed.

Atmosphere 2024, 15(12), 1436; https://doi.org/10.3390/atmos15121436

Submission received: 30 October 2024 / Revised: 26 November 2024 / Accepted: 28 November 2024 / Published: 29 November 2024

(This article belongs to the Section Atmospheric Techniques, Instruments, and Modeling)

Download

Browse Figures

Versions Notes

Abstract

Solar radiation is an important energy source, and accurately predicting it [daily global and diffuse solar radiation (R_s and R_d)] is essential for research on surface energy exchange, hydrologic systems, and agricultural production. However, R_s and R_d estimation relies on meteorological data and related model parameters, which leads to inaccuracy in some regions. To improve the estimation accuracy and generalization ability of the R_s and R_d models, 17 representative radiation stations in China were selected. The categorical boosting (CatBoost) feature selection algorithm was utilized to construct a novel stacking model from sample and parameter diversity perspectives. The results revealed that the characteristics related to sunshine duration (n) and ozone (O₃) significantly affect solar radiation prediction. The proposed new ensemble model framework had better accuracy than base models in root mean square error (RMSE), coefficient of determination (R²), mean absolute error (MAE), and global performance index (GPI). The solar radiation prediction model is more applicable to coastal areas, such as Shanghai and Guangzhou, than to inland regions of China. The range and mean of RMSE, MAE, and R² for R_s prediction are 1.5737–3.7482 (1.9318), 1.1773–2.6814 (1.4336), and 0.7597–0.9655 (0.9226), respectively; for R_d prediction, they are 1.2589–2.9038 (1.8201), 0.9811–2.1024 (1.3493), and 0.5153–0.9217 (0.7248), respectively. The results of this study can provide a reference for R_s and R_d estimation and related applications in China.

Keywords:

solar radiation; air pollution; machine learning model; feature selection; Shapley additive explanations; weather forecast

Graphical Abstract

1. Introduction

Solar radiation is an important energy source for all activities on Earth [1,2]. It is critical for exchanging energy on the Earth’s surface and influences weather and climate [3,4]. Daily global and diffuse solar radiation (R_s and R_d) are vital metrics for assessing solar radiation and are critical in designing and optimizing solar energy systems [5,6,7].

As a result of the high cost of installing and maintaining solar radiation instruments [8,9], obtaining reliable R_s and R_d data is difficult [10], and only a few meteorological stations worldwide record solar radiation data [11,12]. Therefore, scholars have proposed various methods to generate high-quality R_s and R_d data [13,14], such as by using empirical and machine learning (ML) models [15]. Common empirical models utilize n and temperature data to build estimation models [16]. For example, Bailek et al. [17] tested the monthly scattering radiation performance of 35 empirical models based on temperature and n in the Sahara region. The results demonstrated that the quadratic equation model proposed for surface applications is the most accurate. Souza et al. [18] predicted solar radiation in Brazil using the Ångström–Prescott empirical model and utilizing sunshine-hour data. They recommended the use of this model in areas where data on n are available. By comparing empirical models that utilize n, temperature, and other meteorological data, Uçkan et al. [19] determined that the model considering other meteorological factors has higher accuracy and is globally applicable. However, when using the empirical model, reflecting the nonlinear and multidimensional relationship between solar radiation and input features in noisy environments is difficult [20]. Compared with the empirical model, the ML model is more accurate and can better solve nonlinear problems [21]. Accordingly, it has been widely utilized in predicting solar radiation. Muhammed et al. [22] proposed using MLP, ANFIS, and SVM algorithms to predict solar radiation in Cairo, Egypt. The results revealed that the MLP model significantly improves solar radiation prediction. Feng et al. [23] compared the daily global solar radiation predictions of four ML models and four temperature-based empirical models in China. The results reveal that the hybrid mind evolutionary algorithm and ANN model are more accurate than existing ML and empirical models. Although ML algorithms yield more accurate prediction results, they also suffer from the limitation in single-model learning performance caused by randomness, which leads to poor generalization ability of the model. Considering the limitations of a single model, scholars often use an integrated model to predict solar radiation [24]. Dong et al. [25] evaluated the potential of three ML algorithms [support vector machine regression (SVR), Extreme Gradient Boosting (XGBoost), and multivariate adaptive regression splines (MARS)] for estimating R_d using conventional meteorological data from five Chinese weather stations as input. The ensemble learning algorithm demonstrated superior performance and stability. Lee et al. proposed four solar radiation prediction models based on ensemble learning (boosted trees, bagged trees, random forest, and generalized random forest) and compared them with SVR and Gaussian process regression. The results revealed that the ensemble learning method performs well [26]. Additionally, the accuracy of the ensemble learning algorithm surpasses that of a single model, which is achieved by building multiple homogeneous or heterogeneous learning models into an ensemble model [27,28]. However, among the three ensemble learning strategies (bagging, boosting, and stacking), only the stacking model can combine different learners. Moreover, it has low data requirements [29]. By integrating the results of varying base learners, the variance can be reduced and the stability of the model can be improved [30].

Temperature and n are widely modeled to predict radiation because of their representativeness [23]. In addition, other environmental factors and geographic information, such as relative humidity, precipitation, longitude, and latitude, are considered [31]. Radiation decay has recently been a trend in many parts of the world, primarily due to increasing atmospheric pollution [32,33]. For instance, particulate matter (PM_2.5 and PM₁₀), carbon monoxide (CO), nitrogen dioxide (NO₂), sulfur dioxide (SO₂), and O₃ cause radiation attenuation by scattering and absorbing solar radiation [34]. Moreover, conventional meteorological factors may not be good predictors of solar radiation. Accordingly, the predictive influence of air pollutants on solar radiation has attracted the attention of many scholars in recent years [35]. Sun et al. [36] established different RF models by considering the impact of the air pollution index. They found that introducing air pollution data as input can improve RF performance, and the RMSE values could be reduced by 2.0–17.4%. Increasing the number of variables often triggers information redundancy and dimensional catastrophe, thus reducing the performance of the building models [37]. To boost the estimation accuracy of ML, it is important to select model input factors appropriately [38]. For example, using three feature selection methods (recursive feature elimination, variable selection using random forests, and least absolute shrinkage and selection operator (LASSO)), Luo et al. developed a model for the estimation of forest biomass based on various features [39]. The utility of ML models is evinced in the transparency of the model, that is, the effect of input features on the ability to make predictions. When performing tests using black box models, potential errors inherited from training data may lead to significant biases in the results, thus diminishing the confidence in complex black box models [40]. Although many global model interpretation methods exist, such as decision tree-based models that output the mean importance of features, they cannot characterize the effect of individual features on the sample. Therefore, methods for achieving local model interpretation have recently been proposed, such as local interpretable model agnostic explanation [41] and Shapley additive explanations (SHAPs) [42]. For example, Ding et al. [43] used SHAP to determine the key parameters affecting the maturity prediction of compost, and Mitrentsis et al. [40] interpreted PV prediction models by calculating SHAP.

This study aims to build predictive and transparent ML models to improve the accuracy of predicting solar radiation and the interpretability of the models. There were 17 regions selected from different climatic zones in China, and the CatBoost feature selection algorithm was used to output the importance of each feature to determine the input combinations of the model at different stations. Five ML models with low correlation and good performance were selected to construct the stacking model framework. The SHAP values were used to explain the influence of different features on the construction of the stacking model and to evaluate the applicability of the model in China. Although many scholars have employed ML models to predict solar radiation, to the best of our knowledge, no one has combined the CatBoost feature selection algorithm and SHAP values to construct a novel solar radiation prediction model. This approach offers significant advantages in terms of predictive performance and interpretability, thereby making valuable contributions to the research and application of solar energy forecasting.

2. Materials and Methods

2.1. Data Collection and Processing

The following data were collected from 2015 to 2020 from 17 representative regions in China (Figure 1, Table 1): R_s, R_d, n, mean temperature (T_mean), maximum temperature (T_max), minimum temperature (T_min), air pressure (P_r), relative humidity (R_h), precipitation (P_t), wind speed (W_s), and major air pollutant mass concentrations (PM_2.5, PM₁₀, CO, SO₂, NO₂, and O₃) as well as the calculated air quality index (AQI). Meteorological and radiation data were obtained from the China Meteorological Data Network accessed on 11 January 2023 (https://data.cma.cn/, accessed on 1 October 2024). Air pollutant data were derived from the China National Environmental Monitoring Center (http://www.cnemc.cn/, accessed on 1 October 2024). In addition, the values of extra-terrestrial solar radiation (R_a), maximum sunshine duration (N), and vapor pressure deficit (V_pd) were determined according to the methods described by Allen et al. [44]. Some incomplete abnormal data were removed from the data set. Notably, R_d/R_s, R_s/R_a, and n/N were greater than 1 [45].

2.2. Evaluation of Model Input Characteristics

2.2.1. CatBoost Feature Selection Algorithm

Feature selection—the removal of redundant variables to improve the model generalization ability and reduce the complexity of the model—is primarily applied in ML. The tree model CatBoost is selected for feature selection in this study to calculate the feature importance. The feature threshold was set manually in previous studies, and because the threshold is primarily determined by the user, irrational problems may arise. To solve this problem, we propose first arranging the feature importance from high to low, obtaining the median of the feature, and then taking the median as the threshold to remove redundant features. CatBoost replaces the traditional gradient boosting algorithm with orderly lifting and permutation driving to reduce the deviation of gradient estimation and enhance the generalization ability. As an ensemble algorithm with a decision tree as the basic learner, the CatBoost expression [46] is as follows:

G^{N} = \sum_{n = 1}^{N} g^{n}

(1)

where

G^{N}

is the strong ensemble learner and

g^{n}

is a decision tree based on the residuals of the previous tree.

The use of predictions from the previous tree to train the next tree in CatBoost provides a unique method of calculating feature importance. CatBoost provides a variety of important metrics for assessing the importance of features in the parent features. The change in the predicted value and the transformation of the loss function describe the mean change in the predicted value when the eigenvalue is changed alongside a change in the loss function with or without this feature in the model. The greater the mean change in the predicted value (PVC) or the more significant the change in the loss function (LFC), the more important this feature is.

P V C_{F} = \sum_{t r e e s, l e a f s F} {(ν_{1} - α ν r)}^{2} \cdot l e a f_{l} + {(ν_{2} - α ν r)}^{2} \cdot l e a f_{r}

(2)

a ν r = \frac{ν_{1} \cdot l e a f_{l} + ν_{2} \cdot l e a f_{r}}{l e a f_{l} + l e a f_{r}}

(3)

L F C_{j} = l (e x i) - l (f e a t u r e s)

(4)

where

{l e a f}_{1}

and

{l e a f}_{2}

represent the weights of the left and right lobes, respectively.

v_{1}

and

v_{2}

represent the objective function values of the left and right leaves, respectively.

l (e x i)

and

l (f e a t u r e s)

represent the loss function value of the model when the model lacks features and the loss function value of the model when all features are used, respectively.

2.2.2. Shapley Additive Explanation

The SHAP implementation in this study is based on the KernelExplainer principle, and weighted linear regression is used to approximate the SHAP value of any model. The SHAP value technique is a model interpretation analysis method based on game theory and local interpretation, which can reflect the influence of features in each sample on the output. In this study, the mean value of the marginal contribution of the feature to the model output in the case of different feature sequences is calculated via the SHAP value technique to generate the SHAP reference value, which obeys the following formula [47]:

y_{i} = y_{b a s e} + f (x_{i 1}) + f (x_{i 2}) + . . . + f (x_{i j})

(5)

Suppose that

x_{i j}

is the jth feature of the ith sample, the baseline of the whole model is

y_{b a s e}

, and the predicted value of the model for the sample is

y_{i}

. Then, f(

x_{i j}

) is the SHAP value of

x_{i j}

, and f(

x_{i j}

) > 0 denotes that the feature is positively correlated with the output; otherwise, the feature is negatively correlated with the output.

2.3. Learner Selection for Stacking

In stacking ensemble learning, the selection of learners is essential: the basic learners should choose models with stable performance to improve the overall performance of the model, and maximal differences among learners during learner selection should be ensured. This is necessary because different models essentially use different data spaces and data structures by observing data and building the corresponding models afterward. Additionally, this process is dependent on the principles underlying the algorithms themselves. Therefore, selecting learners that differ considerably can ensure that the advantages of different algorithms are leveraged and each differentiated model learns the benefits of other models. In this study, the Pearson correlation coefficient [48] is used to calculate the difference degree for each model to select the model with a large difference degree. Figure 2 shows that the correlations of the algorithms are generally high due to the high learning ability of the different algorithms. The linear model-based models (linear, ridge, and Bayesian) have the highest correlation, followed by the tree-based ensemble algorithms (RF, XGBoost, GBDT, LightGBM, and CatBoost), mainly because of the apparent similarity in the data observation methods for the same class of algorithms and the low correlation of the different principle models, which is attributable to the training mechanisms of the models with other principles also being very different. Considering the algorithm performance and relevance, we chose the Bayesian model in the linear approach, XGBoost in the ensemble approach, and three other approaches with different operational principles (SVR, ANN, and KNN) as the base learner. A meta-learner mainly considers the generalization ability of the model and robustness to overfitting and generally chooses a relatively simple linear model. In contrast, elastic network regression constrains the coefficients by introducing L1 and L2 regularizations, which can screen the base learner and assign the learner different weights according to the coefficient, which helps improve the generalization ability of the model and combine the advantages of multiple models. The learner principle for building the stacking model is presented in the following sections.

2.3.1. Support Vector Regression (SVR)

SVR is a novel and effective model for dealing with regression problems that have become popular in the world in recent years, which can improve the generalization ability of ML and reduce the problem of overfitting by finding the minimum structural risk, which was rapidly developed [49]. SVR separates samples with different labels by searching for a divided hyperplane in the training set. Only one class of sample points exists in SVR, and the optimal hyperplane minimizes the total deviation between all sample points and the hyperplane. The SVR model in this study utilizes the linear function as a kernel function.

2.3.2. Artificial Neural Networks (ANNs)

ANNs are models composed of many interconnected neurons and are widely used in the analysis of various complex problems [50]. In addition to output and input layers, multiple hidden layers exist in the model [51]. However, simple ANNs only have one hidden layer. Specifically, it has a three-layer structure. In this study, each layer is fully connected by a neuron with a rectified linear unit activation function, and the weighted sum of the output value of each layer is the input of each layer. Detailed computation procedures and information on ANNs are found in the work of Xin [52].

2.3.3. K-Nearest Neighbor (KNN)

The main concept behind the KNN, a simple regression method, is calculating the distance of the unknown sample based on that of the known sample, selecting the k-nearest neighboring samples from the feature space of the known sample, determining the weight generated by distance from the k-neighboring samples to the sample, and calculating the weighted mean assignment of the k neighboring samples as the prediction value based on this weight. KNN utilizes the Euclidean distance to sort the training samples in descending order and can calculate the distance in any space. For more details about the KNN model, refer to Nguyen et al. [53].

2.3.4. Bayesian Ridge Regression (Bayesian)

Bayesian regression is a statistical linear regression model solved by Bayesian inference. Notably, it can solve the problem of overfitting in maximum likelihood estimation. Ridge regression, on the other hand, utilizes an L2-regularized penalty term to penalize the regression model, thus achieving reduced regression coefficients to obtain reliable regression coefficients, which is essentially a modified least squares estimation method. Bayesian ridge regression combines the advantages of Bayesian linear regression and ridge regression to utilize sample data fully, so it can accurately determine the complexity of the model using only training samples. Further information and details on the computational procedure of the Bayesian algorithm can be found in the work of Saqib [54].

2.3.5. Extreme Gradient Boosting (XGBoost)

In each iteration, the new tree is used to fit the residual between the prediction result of one tree and the real value of the training sample. The final prediction result is obtained by accumulating the prediction results of all tree models. The objective function is reformed by adding a regularization term to the second-order Taylor expansion, which optimizes the loss function, simplifies the model, and avoids overfitting. Further information on the XGBoost model can be found in the work of Chen et al. [55].

2.3.6. Elastic Network Regression (ElasticNet)

An overfitting risk is inherent in multivariate linear regression models [56] when using the least square method to determine unknown parameters. Accordingly, Lasso involves using an L1-regularized penalty item regression model to reduce the regression coefficient to zero, removing independent variables, and screening out key variables. In contrast, the ridge uses the penalty term of L2 regularization to punish the regression model and simultaneously reduce the regression coefficient and related variables. For prior regular-term training, ElasticNet uses L1 and L2 norms as linear regression models, which can filter and reduce associated variables.

Take a sample

D = {x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{i}, y_{i})}

. The cost function of the elastic network regression algorithm is as follows:

C o s t (w) = \sum_{i = 1}^{N} {(y_{i} - w^{T} x_{i})}^{2} + λ ρ ∥ w ∥_{1} + \frac{λ (1 - ρ)}{2} ∥ w ∥_{2}^{2}

(6)

where λ and ρ parameters control the size of the penalty term.

2.3.7. Stacking

Stacking is a new ensemble approach for enhancing predictive potential and adjusting the bias-variance trade-off of base learners [57,58]. The first-layer model in stacking ensemble regression is the base model, which has different types of basic learners for training the original data set. K-fold cross-validation is performed for a single regression model to avoid overfitting. In the second-layer model, the k training predictions of the base model are included in the characteristics of the meta-learner training data, and the label of the new sample is still that of the original piece. Finally, the new model is used for training and prediction. The meta-learner can fully generalize and correct errors in the output of the first layer, thus improving the accuracy of the model [59]. Figure 3 shows the stacking model.

2.4. Performance Evaluation

Evaluation of the model is based on the RMSE, R², MAE, and GPI [60].

R^{2} = \frac{{[\sum_{i = 1}^{n} (X_{i} - \overline{X}) (Y_{i} - \overline{Y})]}^{2}}{\sum_{i = 1}^{n} {(X_{i} - \overline{X})}^{2} \sum_{i = 1}^{n} {(Y_{i} - \overline{Y})}^{2}}

(7)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - X_{i})}^{2}}

(8)

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - x_{i}|

(9)

G P I = α_{j} \sum_{i = 1}^{3} (T_{j} - {\overline{T}}_{j})

(10)

where

X_{i}

and

Y_{i}

are the estimated and measured values, respectively.

\overline{X}

and

\overline{Y}

are the mean estimated and measured values, respectively. When

T_{j}

is RMSE or MAE,

a_{j}

is −1; when

T_{j}

is R²,

α_{j}

is 1.

T_{j}

is the normalized value of RMSE, MAE, and R², and

{\overline{T}}_{j}

is their median. The model is more accurate when R² is close to 1. The lower the model error, the lower the MAE and RMSE values. GPI has been widely used to rank the overall model performance. The greater the GPI is, the more influential the general prediction of the model.

3. Results and Discussion

3.1. Selection Results of the CatBoost Feature Selection Algorithm

The results of feature selection using CatBoost to predict R_s and R_d are shown in Figure 4. The figure shows that n is the most critical factor affecting R_s, and the mean value and range of its importance are 0.47 (0.17–0.62). The results of the R_d feature selection show that n is the most important factor influencing R_d, and it is the first influencing factor in all regions except Mohe and Shenyang. The mean and range of its importance are 0.28 (0.08–0.69), and n is the second most important influencing factor in Mohe and Shenyang. N is the primary influencing factor of R_d in Mohe and Shenyang, and the feature importance values are 0.18 and 0.13, respectively. N and R_a performed better in calculating the characteristic importance of R_s and R_d than other factors except for n. The mean and range of the importance of R_s were 0.10 (0.04–0.21) and 0.08 (0.03–0.17), respectively, and the mean and range of the importance of R_d were 0.13 (0.03-–0.17) and 0.10 (0.03–0.15), respectively. This result is attributable to n being the most direct factor affecting radiation, whereas N and R_a are closely related to n. Although temperature-related characteristics can indirectly reflect cloud cover, the relationship between radiation and temperature is weaker than that between radiation and n [61,62]. Utilizing O₃ data yields better performance than when other air pollution and meteorological data are utilized (except n, N, R_a), as revealed in the R_s and R_d of the characteristic importance calculation results. The mean value of its importance between 0.06 (0.01–0.15) and 0.05 (0.02–0.11) yields this result because R_d is the solar radiation passing through airborne substances, such as aerosols and molecules in the air scattered and formed by diffuse reflection from the surface, and O₃ can scatter and absorb solar radiation, thus reducing the solar radiation reaching the Earth’s surface. Therefore, introducing air pollution is important for solar radiation prediction, which is consistent with the findings of previous studies, such as that by Mohammadi et al. [63], who evaluated the influence of nine characteristic factors on the prediction of daily radiation in Iran using the ANFIS model. They showed that n, N, and R_a are the most influential input parameters for predicting R_s and R_d. Furthermore, Fan et al. [64] confirmed that air pollution improves R_d prediction accuracy. The specifics of the feature ranking slightly differ among stations, presumably due to the geographical location and climatic differences among stations.

The greater the importance of the input variable, the higher the correlation between the variable and the output prediction, and removing less important features can reduce redundancy [65]. Finally, the median of the number of factors was selected as the threshold value, and feature screening was conducted for R_s and R_d to construct the R_s and R_d prediction models. The screening results are listed in Table A1.

3.2. Shapley Additive Explanation (SHAP) Analysis

Traditional feature importance interpretation methods only explain the importance of features without elucidating how these features affect the prediction results. Therefore, this study introduced SHAP technology as a supplement to the explanatory analysis of the constructed stacking model. SHAP can not only reflect the degree of influence of each feature in each sample on the results but also how each feature affects the results and the positive and negative aspects of the influence [47]. Beijing is taken as the representative region, with other regions illustrated in Figure A1, Figure A2, Figure A3 and Figure A4. Figure 5 shows the mean absolute SHAP values of the stacking model. n indicates the highest mean SHAP value of 3.54 when the stacking model predicts R_s, followed by R_a with a mean SHAP value of 2.86, and T_max has a lower mean SHAP value of 0.05. N exhibits the highest mean SHAP value of 1.39 for predicting R_d, followed by n with a mean SHAP value of 1.17, and SO₂ has a lower mean SHAP value of 0.04.

The global interpretation in Figure 6 reveals that the first 14 features (n, R_a, T_min, V_pd, T_mean, O₃, N, NO₂, R_h, PM₁₀, AQI, P_t, SO₂, and PM_2.5) have a dominant influence on radiation regarding the stacking model prediction of R_s. The last four features (P_r, CO, W_s, and T_max) are mostly concentrated in the nearby 0 SHAP, indicating that these features have little influence on radiation prediction for most days. When predicting R_d using the stacking model, the first 13 features (N, n, R_a, O₃, PM_2.5, P_r, AQI, R_h, P_t, V_pd, PM₁₀, T_max, and NO₂) have a significant influence on radiation. The last five features (T_min, T_mean, CO, W_s, and SO₂) are concentrated around 0 SHAP. The most important features in the SHAP value ranking are those related to sunshine duration (n, N, and R_a), and the larger their values, the larger the corresponding SHAP values. This shows that the larger these eigenvalues are, the larger the solar radiation is. The absolute value of SHAP of the temperature feature when predicting R_s is higher than that of predicting R_d, indicating that the contribution of the temperature feature set to the prediction of R_s is greater than that of R_d. The rainfall features for some days also have a considerable effect on radiation prediction, and the smaller the value, the smaller the solar radiation prediction value, which is consistent with the intuition of rainfall being closely related to cloud cover and clouds diminishing solar radiation. And several studies have shown the same conclusion that rainfall has a definite effect on radiation estimates [62,66,67]. Regarding air pollution characteristics, a larger O₃ value has a greater influence on radiation prediction. Introducing O₃ can improve the prediction accuracy of solar radiation because the ozone layer absorbs a large amount of radiation [45].

3.3. Performance of Different ML Models in Radiation Estimation

Seventeen radiation stations in different regions of China were considered. After applying the CatBoost feature selection algorithm, the selected features were further input into different models. The results of these models in the testing phase are shown in Table A2. Figure 7 and Figure 8, and Table A2 reveal that XGBoost and KNN performed better than the other base learners, and the range and mean values of the XGBoost model RMSE, MAE, R², and GPI in predicting R_s were 1.6322–3.9516 (2.0580), 1.2141–2.8782 (1.5218), 0.7330–0.9564 (0.9124), and from −0.7450 to 0.9210 (0.6326), respectively. The ranges and means of the RMSE, MAE, R², and GPI of the KNN models were 1.7749–4.1885 (2.2043), 1.3251–2.8946 (1.6244), 0.7000–0.9575 (0.8988), and from −0.9434 to 0.8271 (0.5165), respectively. The ranges and means of the RMSE, MAE, R², and GPI of XGBoost models for predicting R_d were 1.2996–3.1848 (1.9300), 1.0189–2.2588 (1.4250), 0.4551–0.9058, (0.6932) and 0.0665–0.7855 (0.4659), respectively. The ranges and means of the RMSE, MAE, R², and GPI of the KNN models were 1.4109–3.1970 (1.9891), 1.1182–2.4330 (1.4688), 0.4216–0.9051 (0.6753), and from −0.0161 to 0.6746 (0.4186), respectively. Because XGBoost is an ensemble learning method, it can accurately upgrade a weak learner to a stronger learner by integrating multiple base models to solve the problem. Its high training speed and strong generalization ability make it a popular choice [68]. XGBoost expands the loss function by Taylor’s second order and utilizes the first- and second-order derivatives for updating and iterating during optimization; the model makes full use of the data. Therefore, selecting a base learner with good performance improves the overall performance of the ensemble model [61,69]. The stacking model had the highest accuracy for predicting R_s. The range and mean values of RMSE, MAE, R², and GPI were 1.5737–3.7482 (1.9318), 1.1773–2.6814 (1.4336), 0.7597–0.9655 (0.9226), and from −0.5614 to 0.9542 (0.7122), respectively. Stacking had an RMSE and MAE mean reduction range of 6.14%–25.25% and 5.79%–26.48%, respectively, and the R² mean improvement range was 1.12%–7.09%. The range and mean values of RMSE, MAE, R², and GPI for the highest accuracy in predicting R_d compared to the base learner were 1.2589–2.9038 (1.8201), 0.9811–2.1024 (1.3493), 0.5153–0.9217 (0.7248), and from 0.1877 to 0.8135 (0.5472), respectively. Stacking had an RMSE and MAE mean reduction range of 5.70%–23.27% and 5.31%–24.30%, respectively, and the R² mean improvement range was 4.56%–36.24%, demonstrating that the accuracy improvement in stacking is great, and the ensemble learning model construction is effective.

Figure 9 and Figure 10 demonstrate that the stacking model is closer to the 1:1 line in the data points in the scatter plot, indicating its higher accuracy. See Figure A5 and Figure A6 for scatter plots in other regions. The stacking model results for R_s and R_d prediction are represented by box plots in Figure 11 and Figure 12, respectively, and the 50th percentile (P50) was used as the benchmark for evaluation. The higher P50 is, the better the mean performance of the whole model. The P50 values of RMSE in predicting R_s for the Bayesian, KNN, ANN, SVR, XGBoost, and stacking models at 17 stations were 2.1074, 2.0503, 2.4454, 2.2331, 1.9013, and 1.7402 MJ m⁻²d⁻¹, respectively; the P50 values of MAE were 1.5825, 1.5531, 1.8388, 1.5891, 1.4095, and 1.2810 MJ m⁻²d⁻¹; the P50 values of R² were 0.9138, 0.9225, 0.8869, 0.9080, 0.9287, and 0.9421; and P50 values of GPI were 0.5922, 0.6669, 0.3406, 0.5443, 0.7435, and 0.7989, respectively. Additionally, The P50 values of RMSE in predicting R_d for the Bayesian, KNN, ANN, SVR, XGBoost, and stacking models were 2.3032, 1.8867, 2.0640, 2.3815, 1.7395, and 1.6492 MJ m⁻²d⁻¹; the P50 values of MAE were 1.7705, 1.3968, 1.5253, 1.8120, 1.3221, and 1.2413 MJ m⁻²d⁻¹; the P50 values of R² were 0.5533, 0.6951, 0.5996, 0.5394, 0.7118, and 0.7567; and P50 values of GPI were 0.1297, 0.4480, 0.2476, 0.0884, 0.5231, and 0.5988, respectively. The base learners are ranked according to performance as XGBoost > KNN > ANN > Bayesian > SVR. The Taylor diagram in Figure 13 and Figure 14 visualizes the performance of different models estimating R_s and R_d in Beijing. Taylor diagrams for other regions are in Figure A7 and Figure A8. Obviously, the stacking model is closer to the observation point and has better performance than other models. These results confirm that the proposed stacking ensemble model achieves satisfactory accuracy in predicting R_s and R_d. By choosing models with diverse principles and less correlation, data with different data frames and spatial dimensions can be observed and the corresponding models can be constructed according to the principles of the algorithm itself [43]. Selecting algorithms with divergent principles complements the differentiation of the models, thus utilizing their strengths and overcoming their weaknesses [29].

3.4. Solar Radiation Performance of the Stacking Model in Different Regions

Figure 8 and Table A2 reveal that in predicting R_s using the stacking model, the precision is highest in coastal areas, such as Shanghai and Guangzhou, except in Sanya. The ranges of RMSE, MAE, R², and GPI are 1.5737–1.6160, 1.1773–1.2529, 0.9375–0.9530, and 0.8980–0.9542, respectively. The values of RMSE, MAE, R², and GPI in Sanya are 1.8783, 1.4672, 0.8989, and 0.6438, respectively. In inland areas, the results were better in Wenjiang, where RMSE, MAE, R², and GPI were 1.5830, 1.2111, 0.9498, and 0.9394, respectively. The worst accuracy was observed in Shenyang, where the RMSE, MAE, R², and GPI were 3.7482, 2.6814, 0.7597, and −0.5614, respectively. Figure 12 and Table A2 reveal that in predicting R_d using the stacking model, the precision is highest in coastal areas, such as Shanghai and Guangzhou, except in Sanya. The range of GPI is 0.7123–0.8135. In the inland area, the Wenjiang area performed better, with a GPI of 0.8055. Shenyang has the worst accuracy with a GPI of 0.1877. This is consistent with the findings of Jia et al. [70], who evaluated the performance of three common ML algorithms (linear modeling, SVR, and RF), in solar radiation estimation in eight cities in China. The results demonstrated that the performance of the model in coastal areas was higher than that in inland areas. Wang et al. [71] obtained similar results when using 97 empirical models to predict the R_d of 17 cities in China; that is, the accuracy of the model was higher in coastal areas, but the performance of the rainy-weather model in Sanya was poor. Presumably, air conditions may deteriorate the model accuracy in some inland areas.

4. Conclusions

In this study, 17 typical radiation stations in China are selected, and meteorological data and air pollution data are collected. The characteristic R_s and R_d factors are determined using the CatBoost feature selection algorithm, and the daily R_s and R_d prediction models are constructed by combining the proposed stacking. The SHAP value is used to explain the ensemble model and verify the reliability of the feature selection algorithm. The main conclusions of this study are as follows:

(1): Among the meteorological factors, n and its related characteristics (R_a and N) have the greatest influence on the prediction of solar radiation (R_s, R_d), whereas O₃ has the greatest influence on the air pollution data. The most important feature is n, and the higher its value, the greater the influence on the radiation prediction. Regarding air pollution characteristics, a larger O₃ value implies a greater effect on radiation prediction.
(2): Compared with base learners, the proposed stacking model performs optimally with a mean improvement range of 5.70%–25.25% for RMSE, 5.31%–26.48% for MAE, and 1.12%–36.24% for R², thus highlighting the necessity of ensemble learning model construction.
(3): This study provides a reference for selecting predicted radiation input characteristics in different climatic regions in China. Notably, the accuracy of the proposed stacking model in coastal areas (Shanghai and Guangzhou) is better than that in inland regions.

Author Contributions

Y.D.: conceptualization, methodology, validation, writing—original draft. Y.W.: conceptualization, writing—original draft, investigation, validation, formal analysis. Z.L.: data curation, writing—review and editing, validation, formal analysis. L.Z.: data curation, software, funding acquisition. Y.S.: data curation, Supervision. S.C.: data curation. X.X.: supervision, project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 52309050 and 32372680), the Ph.D. Research Startup Foundation of Henan University of Science and Technology (No. 13480025 and 13480033), Key R&D and Promotion Projects in Henan Province (Science and Technology Development) (No. 232102110264), Henan Provincial Tobacco Company Luoyang City Company Technology Innovation Pro (No. 2023410300200043), Key Scientific Research Projects of Colleges and Universities in Henan Province (No. 24B416001), and the Innovative Research Team (Science and Technology) in the University of Henan Province (23IRTSTHN024).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

For this study, we are grateful to the National Meteorological Information Centre of the China Meteorological Administration for supplying the climate database.

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

AQI	air quality index
CO	carbon monoxide
GPI	global performance index
MAE	mean absolute error
n	sunshine duration
N	maximum sunshine duration
NO₂	nitrogen dioxide
O₃	ozone
P_r	air pressure
P_t	precipitation
R²	coefficient of determination
R_a	extra-terrestrial solar radiation
R_d	diffuse solar radiation
R_h	relative humidity
RMSE	root mean square error
R_s	global solar radiation
SO₂	sulfur dioxide
T_max	maximum temperature
T_mean	mean temperature
T_min	minimum temperature
V_pd	vapor pressure deficit
W_s	wind speed

Appendix A

(See Table A1 and Table A2).

Table A1. Results of CatBoost feature selection.

Station	Selected Features
Station	R_s	R_d
Mohe	‘n’, ‘R_a’, ‘V_pd’, ‘N’, ‘O₃’, ‘T_max’, ‘P_t’, ‘R_h’, ‘P_r’	‘N’, ‘R_a’, ‘SO₂’, ‘n’, ‘CO’, ‘P_r’, ‘O₃’, ‘R_h’, ‘NO₂’
Harbin	‘n’, ‘N’, ‘R_a’, ‘V_pd’, ‘T_mean’, ‘O₃’, ‘P_t’, ‘R_h’, ‘T_max’	‘n’, ‘N’, ‘R_a’, ‘T_min’, ‘P_t’, ‘T_mean’, ‘T_max’, ‘CO’, ‘P_r’
Urumqi	‘n’, ‘N’, ‘R_a’, ‘O₃’, ‘P_t’, ‘R_h’, ‘NO₂’, ‘V_pd’, ‘T_mean’	‘n’, ‘N’, ‘R_a’, ‘O₃’, ‘NO₂’, ‘R_h’, ‘W_s’, ‘P_r’, ‘T_min’
Kashgar	‘n’, ‘R_a’, ‘N’, ‘O₃’, ‘T_mean’, ‘T_min’, ‘R_h’, ‘V_pd’, ‘W_s’	‘n’, ‘N’, ‘R_a’, ‘PM₁₀’, ‘AQI’, ‘PM_2.5’, ‘T_min’, ‘W_s’, ‘NO₂’
Ejin Banner	‘n’, ‘N’, ‘R_a’, ‘NO₂’, ‘R_h’, ‘T_mean’, ‘V_pd’, ‘CO’, ‘SO₂’	‘n’, ‘N’, ‘R_a’, ‘W_s’, ‘PM₁₀’, ‘R_h’, ‘PM_2.5’, ‘P_r’, ‘SO₂’
Yuzhong	‘n’, ‘N’, ‘O₃’, ‘R_a’, ‘T_max’, ‘V_pd’, ‘T_min’, ‘T_mean’, ‘CO’	‘n’, ‘N’, ‘R_a’, ‘T_max’, ‘NO₂’, ‘T_min’, ‘V_pd’, ‘P_t’, ‘CO’
Shenyang	‘n’, ‘O₃’, ‘N’, ‘R_a’, ‘P_r’, ‘T_min’, ‘PM_2.5’, ‘V_pd’, ‘T_max’	‘P_r’, ‘N’, ‘R_a’, ‘n’, ‘O₃’, ‘NO₂’, ‘T_mean’, ‘T_min’, ‘R_h’
Beijing	‘n’, ‘N’, ‘O₃’, ‘V_pd’, ‘R_a’, ‘T_min’, ‘P_t’, ‘NO₂’, ‘R_h’	‘n’, ‘N’, ‘R_a’, ‘O₃’, ‘PM_2.5’, ‘P_t’, ‘R_h’, ‘V_pd’, ‘NO₂’
Lhasa	‘n’, ‘N’, ‘R_a’, ‘T_max’, ‘PM₁₀’, ‘O₃’, ‘T_mean’, ‘V_pd’, ‘R_h’	‘n’, ‘T_max’, ‘PM₁₀’, ‘R_h’, ‘P_r’, ‘V_pd’, ‘N’, ‘O₃’, ‘T_mean’
Wenjiang	‘n’, ‘O₃’, ‘N’, ‘T_max’, ‘R_a’, ‘V_pd’, ‘T_min’, ‘SO₂’, ‘PM₁₀’	‘n’, ‘O₃’, ‘N’, ‘R_a’, ‘T_min’, ‘T_max’, ‘V_pd’, ‘W_s’, ‘P_t’
Kunming	‘n’, ‘T_max’, ‘O₃’, ‘V_pd’, ‘SO₂’, ‘T_min’, ‘N’, ‘PM_2.5’, ‘R_a’	‘n’, ‘N’, ‘R_a’, ‘T_max’, ‘V_pd’, ‘T_min’, ‘R_h’, ‘O₃’, ‘T_mean’
Zhengzhou	‘n’, ‘N’, ‘R_a’, ‘O₃’, ‘V_pd’, ‘T_max’, ‘T_mean’, ‘P_t’, ‘NO₂’	‘n’, ‘N’, ‘R_a’, ‘O₃’, ‘P_t’, ‘T_max’, ‘R_h’, ‘V_pd’, ‘SO₂’
Wuhan	‘n’, ‘O₃’, ‘N’, ‘R_a’, ‘P_t’, ‘T_min’, ‘T_max’, ‘V_pd’, ‘NO₂’	‘n’, ‘N’, ‘R_a’, ‘O₃’, ‘P_t’, ‘V_pd’, ‘CO’, ‘R_h’, ‘T_max’
Guiyang	‘n’, ‘N’, ‘R_a’, ‘O₃’, ‘V_pd’, ‘P_r’, ‘T_min’, ‘SO₂’, ‘T_mean’	‘n’, ‘V_pd’, ‘N’, ‘R_a’, ‘O₃’, ‘R_h’, ‘P_r’, ‘T_max’, ‘NO₂’
Shanghai	‘n’, ‘N’, ‘R_a’, ‘V_pd’, ‘P_t’, ‘O₃’, ‘T_max’, ‘T_mean’, ‘P_r’	‘n’, ‘N’, ‘R_a’, ‘V_pd’, ‘P_t’, ‘O₃’, ‘PM_2.5’, ‘P_r’, ‘PM₁₀’
Guangzhou	‘n’, ‘N’, ‘O₃’, ‘Ra’, ‘V_pd’, ‘T_mean’, ‘T_max’, ‘R_h’, ‘P_t’	‘n’, ‘N’, ‘R_a’, ‘V_pd’, ‘P_t’, ‘O₃’, ‘PM_2.5’, ‘P_r’, ‘PM₁₀’
Sanya	‘n’, ‘T_max’, ‘N’, ‘R_a’, ‘W_s’, ‘T_mean’, ‘V_pd’, ‘T_min’, ‘P_t’	‘n’, ‘R_a’, ‘N’, ‘T_max’, ‘P_t’, ‘V_pd’, ‘W_s’, ‘P_r’, ‘T_mean’

Table A2. Results of model accuracy.

Station	Model	R_s					R_d
Station	Model	RMSE (MJ m⁻² d⁻¹)	MAE (MJ m⁻² d⁻¹)	R²	GPI	Rank	RMSE (M Jm⁻² d⁻¹)	MAE (MJ m⁻² d⁻¹)	R²	GPI	Rank
Mohe	Bayesian	2.1918	1.7144	0.9071	0.5563	64	2.7046	1.9142	0.4392	−0.1873	96
	KNN	1.9807	1.3251	0.9241	0.7711	28	2.1523	1.4765	0.6449	0.3683	50
	ANN	2.3784	1.7955	0.8906	0.4265	75	2.7138	1.9335	0.4354	−0.1958	97
	SVR	2.3133	1.7336	0.8965	0.4724	71	2.7881	1.8902	0.4040	−0.2662	99
	XGBoost	2.1291	1.4671	0.9123	0.6532	50	2.2428	1.5118	0.6144	0.3079	52
	Stacking	1.8686	1.3225	0.9325	0.8029	20	2.0729	1.4494	0.6706	0.4179	39
Harbin	Bayesian	2.0972	1.5900	0.9132	0.6140	55	1.9911	1.4315	0.6453	0.3905	45
	KNN	2.0296	1.4477	0.9187	0.6868	43	1.7722	1.2932	0.7190	0.5618	28
	ANN	2.6591	2.0056	0.8605	0.2122	87	2.1163	1.5929	0.5993	0.2476	57
	SVR	2.1542	1.5628	0.9084	0.5886	60	1.9804	1.3846	0.6491	0.4184	38
	XGBoost	1.9013	1.4095	0.9287	0.7435	33	1.7395	1.2146	0.7293	0.6141	19
	Stacking	1.7134	1.2734	0.9421	0.8639	12	1.6492	1.1659	0.7567	0.6764	12
Urumqi	Bayesian	2.7671	2.1300	0.9419	0.4683	72	3.3087	2.6906	0.8984	0.2571	56
	KNN	2.3679	1.7471	0.9575	0.6740	45	3.1970	2.0990	0.9051	0.4400	37
	ANN	3.8613	2.8224	0.8869	−0.1551	92	4.1473	3.0654	0.8403	−0.1159	92
	SVR	3.3488	2.3650	0.9149	0.1885	88	3.4619	2.7063	0.8887	0.1903	63
	XGBoost	2.3976	1.7975	0.9564	0.6590	49	3.1848	2.1865	0.9058	0.3991	43
	Stacking	2.1316	1.6042	0.9655	0.7917	25	2.9038	2.0177	0.9217	0.5027	32
Kashgar	Bayesian	2.1529	1.5410	0.9218	0.6486	51	2.4672	1.9025	0.4783	−0.0732	90
	KNN	1.9285	1.3431	0.9372	0.8094	19	2.1505	1.6264	0.6037	0.2387	60
	ANN	2.6621	1.8388	0.8804	0.3406	82	2.5345	1.9817	0.4495	−0.1522	95
	SVR	2.2693	1.5594	0.9131	0.6072	56	2.4775	1.8891	0.4739	−0.0730	89
	XGBoost	1.7351	1.2141	0.9492	0.9210	4	2.0379	1.5694	0.6441	0.3352	51
	Stacking	1.7402	1.1964	0.9489	0.9293	3	1.9745	1.4958	0.6659	0.3889	46
Ejin Banner	Bayesian	1.8779	1.4681	0.9413	0.7982	22	1.5072	1.1420	0.5530	0.3980	44
	KNN	2.0167	1.5041	0.9323	0.7136	39	1.5122	1.1182	0.5500	0.4052	41
	ANN	2.4295	1.7217	0.9018	0.4803	70	1.5234	1.1510	0.5433	0.3800	48
	SVR	2.0163	1.4906	0.9323	0.7138	38	1.5298	1.1286	0.5394	0.3852	47
	XGBoost	1.8504	1.3298	0.9430	0.8375	13	1.4453	1.0650	0.5889	0.4861	34
	Stacking	1.7101	1.2810	0.9513	0.8974	9	1.3756	1.0100	0.6276	0.5676	27
Yuzhong	Bayesian	1.9320	1.4760	0.9338	0.7507	30	2.3032	1.7705	0.4478	−0.0532	88
	KNN	1.8964	1.4146	0.9362	0.7728	27	1.9778	1.4778	0.5928	0.2936	54
	ANN	1.8332	1.4079	0.9404	0.8116	18	1.9612	1.4817	0.5996	0.3014	53
	SVR	1.9735	1.4485	0.9309	0.7309	34	2.3815	1.8120	0.4096	−0.1275	93
	XGBoost	1.7937	1.3454	0.9429	0.8356	14	1.9725	1.4856	0.5950	0.2929	55
	Stacking	1.6804	1.2441	0.9499	0.9079	6	1.8468	1.3773	0.6449	0.4160	40
Shenyang	Bayesian	4.1527	3.0711	0.7051	−0.9483	100	2.3879	1.6394	0.4344	−0.0844	91
	KNN	4.1885	2.8946	0.7000	−0.9434	99	2.2269	1.5212	0.5081	0.1522	70
	ANN	4.1614	3.0661	0.7039	−0.9502	101	2.3477	1.5913	0.4533	−0.0029	85
	SVR	4.2514	3.0254	0.6909	−1.0000	102	2.4509	1.6019	0.4042	−0.1493	94
	XGBoost	3.9516	2.8782	0.7330	−0.7450	97	2.3046	1.5955	0.4732	0.0665	81
	Stacking	3.7482	2.6814	0.7597	−0.5614	96	2.1878	1.4982	0.5253	0.1877	64
Beijing	Bayesian	2.3739	1.7793	0.9097	0.4979	67	2.3646	1.7941	0.6165	0.1828	65
	KNN	2.1076	1.5725	0.9288	0.6669	46	1.7896	1.3066	0.7803	0.6426	15
	ANN	2.2310	1.6586	0.9202	0.5896	59	1.8379	1.3631	0.7683	0.5984	23
	SVR	2.5790	1.8408	0.8934	0.3871	78	2.4842	1.7709	0.5767	0.1301	73
	XGBoost	1.9783	1.4928	0.9373	0.7461	32	1.7378	1.2575	0.7929	0.6840	11
	Stacking	1.8961	1.4168	0.9424	0.7953	24	1.6405	1.1985	0.8154	0.7444	5
Lhasa	Bayesian	1.8594	1.4395	0.8742	0.5609	63	2.9043	2.1891	0.7368	0.1671	69
	KNN	2.1159	1.5922	0.8372	0.3301	83	3.1569	2.4330	0.6890	−0.0161	86
	ANN	2.7218	2.1076	0.7305	−0.3469	95	3.6526	2.8312	0.5837	−0.3688	101
	SVR	1.9900	1.4940	0.8560	0.4455	73	3.0334	2.2771	0.7129	0.0884	78
	XGBoost	1.8151	1.3435	0.8802	0.6014	57	3.0388	2.2588	0.7118	0.0883	79
	Stacking	1.7011	1.2535	0.8947	0.7020	41	2.7780	2.1024	0.7592	0.2428	58
Wenjiang	Bayesian	1.9252	1.5010	0.9258	0.7241	35	1.6984	1.3169	0.6088	0.4025	42
	KNN	1.8894	1.4664	0.9285	0.7474	31	1.4109	1.1182	0.7300	0.6746	13
	ANN	1.8219	1.4432	0.9335	0.7909	26	1.3805	1.0739	0.7415	0.7014	9
	SVR	1.9572	1.5264	0.9233	0.7031	40	1.7294	1.3526	0.5944	0.3713	49
	XGBoost	1.6322	1.2413	0.9467	0.9095	5	1.2996	1.0189	0.7710	0.7714	4
	Stacking	1.5830	1.2111	0.9498	0.9394	2	1.2589	0.9811	0.7851	0.8055	2
Kunming	Bayesian	2.4756	1.9301	0.8515	0.2480	86	2.0656	1.6469	0.5633	0.2106	62
	KNN	2.4103	1.7799	0.8592	0.3005	84	1.7261	1.3115	0.6951	0.5189	31
	ANN	2.5556	1.9765	0.8418	0.1826	90	1.7430	1.3557	0.6891	0.5013	33
	SVR	2.5512	1.9286	0.8423	0.1862	89	2.0952	1.6264	0.5507	0.1824	66
	XGBoost	2.2803	1.7327	0.8740	0.4028	77	1.7173	1.3221	0.6982	0.5231	30
	Stacking	2.1655	1.6358	0.8864	0.4908	68	1.6212	1.2413	0.7310	0.6037	20
Zhengzhou	Bayesian	2.1074	1.5631	0.9149	0.6162	54	2.5686	1.9922	0.5672	0.0419	83
	KNN	2.1584	1.5784	0.9107	0.5885	61	1.8867	1.3968	0.7665	0.5797	26
	ANN	2.4454	1.8581	0.8854	0.3826	79	2.3237	1.8028	0.6458	0.2387	59
	SVR	2.1919	1.5861	0.9079	0.5743	62	2.6730	2.0114	0.5313	−0.0500	87
	XGBoost	1.7064	1.2659	0.9442	0.8755	11	1.7201	1.2745	0.8059	0.6944	10
	Stacking	1.6830	1.2375	0.9457	0.8960	10	1.6387	1.2315	0.8238	0.7406	6
Wuhan	Bayesian	2.1617	1.6598	0.9138	0.5922	58	2.4688	1.9253	0.5975	0.1197	75
	KNN	2.0503	1.5531	0.9225	0.6653	47	2.0680	1.5262	0.7176	0.4480	36
	ANN	2.1136	1.5833	0.9176	0.6240	53	2.0640	1.5253	0.7187	0.4500	35
	SVR	2.2333	1.6909	0.9080	0.5443	65	2.5581	1.9822	0.5679	0.0467	82
	XGBoost	1.9836	1.4564	0.9274	0.7139	37	1.9261	1.4523	0.7550	0.5367	29
	Stacking	1.8406	1.3651	0.9375	0.7989	21	1.8608	1.3714	0.7714	0.5988	22
Guiyang	Bayesian	2.8445	2.1639	0.8303	0.0200	91	1.8099	1.3897	0.4760	0.1750	68
	KNN	2.5333	1.9454	0.8654	0.2771	85	1.4346	1.1220	0.6708	0.5820	25
	ANN	3.7968	2.8784	0.6977	−0.8734	98	1.8045	1.4026	0.4791	0.1812	67
	SVR	3.1248	2.2558	0.7953	−0.1992	94	1.8442	1.3974	0.4560	0.1374	71
	XGBoost	2.3494	1.7264	0.8843	0.4144	76	1.3859	1.0940	0.6928	0.6302	16
	Stacking	2.3101	1.7507	0.8881	0.4431	74	1.3464	1.0684	0.7100	0.6684	14
Shanghai	Bayesian	2.0299	1.5825	0.9258	0.6848	44	2.2584	1.8567	0.5533	0.1297	74
	KNN	1.8070	1.3368	0.9412	0.8271	15	1.6760	1.2782	0.7540	0.6188	17
	ANN	1.8217	1.3967	0.9402	0.8152	17	1.5510	1.2025	0.7893	0.7105	8
	SVR	2.0611	1.5891	0.9235	0.6648	48	2.2541	1.8447	0.5550	0.1336	72
	XGBoost	1.7172	1.2426	0.9469	0.8976	8	1.4645	1.1032	0.8122	0.7855	3
	Stacking	1.6160	1.1773	0.9530	0.9542	1	1.4217	1.0770	0.8230	0.8135	1
Guangzhou	Bayesian	1.8411	1.5085	0.9145	0.7142	36	2.1980	1.7564	0.5240	0.1089	76
	KNN	1.7749	1.3943	0.9205	0.7609	29	1.6496	1.2628	0.7319	0.5947	24
	ANN	1.7249	1.4105	0.9249	0.7957	23	1.6422	1.2846	0.7343	0.6006	21
	SVR	1.8690	1.5391	0.9119	0.6943	42	2.2258	1.7745	0.5119	0.0820	80
	XGBoost	1.6858	1.3248	0.9283	0.8226	16	1.6252	1.2496	0.7398	0.6142	18
	Stacking	1.5737	1.2529	0.9375	0.8980	7	1.5092	1.1474	0.7756	0.7123	7
Sanya	Bayesian	2.0116	1.5810	0.8841	0.5400	66	2.2276	1.7637	0.3020	−0.2575	98
	KNN	2.2169	1.7200	0.8592	0.3728	80	2.0277	1.6016	0.4216	0.0126	84
	ANN	2.7130	2.1768	0.7892	−0.1700	93	2.2544	1.8058	0.2851	−0.3018	100
	SVR	2.2331	1.7238	0.8572	0.3592	81	2.3561	1.8536	0.2191	−0.4186	102
	XGBoost	2.0795	1.6019	0.8761	0.4856	69	1.9681	1.5654	0.4551	0.0904	77
	Stacking	1.8783	1.4672	0.8989	0.6438	52	1.8562	1.5052	0.5153	0.2148	61

Figure A1. The mean absolute SHAP value of the stacking model in estimating R_s.

Figure A2. The mean absolute SHAP value of the stacking model in estimating R_d.

Figure A3. The input feature SHAP value of stacking when estimating R_s.

Figure A4. The input feature SHAP value of stacking when estimating R_d.

Figure A5. Scatter density plot of different models in estimating R_s.

Figure A6. Scatter density plot of different models in estimating R_d.

Figure A7. Taylor plots of different models when estimating R_s.

Figure A8. Taylor plots of different models when estimating R_d.

References

Acikgoz, H. A novel approach based on integration of convolutional neural networks and deep feature selection for short-term solar radiation forecasting. Appl. Energy 2022, 305, 117912. [Google Scholar] [CrossRef]
Sohrabi Geshnigani, F.; Golabi, M.R.; Mirabbasi, R.; Tahroudi, M.N. Daily solar radiation estimation in Belleville station, Illinois, using ensemble artificial intelligence approaches. Eng. Appl. Artif. Intell. 2023, 120, 105839. [Google Scholar] [CrossRef]
Ajith, M.; Martínez-Ramón, M. Deep learning based solar radiation micro forecast by fusion of infrared cloud images and radiation data. Appl. Energy 2021, 294, 117014. [Google Scholar] [CrossRef]
Mayer, M.J. Benefits of physical and machine learning hybridization for photovoltaic power forecasting. Renew. Sustain. Energy Rev. 2022, 168, 112772. [Google Scholar] [CrossRef]
Amiri, B.; Gómez-Orellana, A.M.; Gutiérrez, P.A.; Dizène, R.; Hervás-Martínez, C.; Dahmani, K. A novel approach for global solar irradiation forecasting on tilted plane using Hybrid Evolutionary Neural Networks. J. Clean. Prod. 2021, 287, 125577. [Google Scholar] [CrossRef]
Shao, C.; Yang, K.; Tang, W.; He, Y.; Jiang, Y.; Lu, H.; Fu, H.; Zheng, J. Convolutional neural network-based homogenization for constructing a long-term global surface solar radiation dataset. Renew. Sustain. Energy Rev. 2022, 169, 112952. [Google Scholar] [CrossRef]
Yang, D.; Gueymard, C.A. Ensemble model output statistics for the separation of direct and diffuse components from 1-min global irradiance. Sol. Energy 2020, 208, 591–603. [Google Scholar] [CrossRef]
Feng, Y.; Cui, N.; Zhang, Q.; Zhao, L.; Gong, D. Comparison of artificial intelligence and empirical models for estimation of daily diffuse solar radiation in North China Plain. Int. J. Hydrogen Energy 2017, 42, 14418–14428. [Google Scholar] [CrossRef]
Yagli, G.M.; Yang, D.; Gandhi, O.; Srinivasan, D. Can we justify producing univariate machine-learning forecasts with satellite-derived solar irradiance? Appl. Energy 2020, 259, 114122. [Google Scholar] [CrossRef]
Zhou, Y.; Li, Y.; Wang, D.; Liu, Y. A multi-step ahead global solar radiation prediction method using an attention-based transformer model with an interpretable mechanism. Int. J. Hydrogen Energy 2023, 48, 15317–15330. [Google Scholar] [CrossRef]
Lu, Y.; Wang, L.; Zhu, C.; Zou, L.; Zhang, M.; Feng, L.; Cao, Q. Predicting surface solar radiation using a hybrid radiative Transfer–Machine learning model. Renew. Sustain. Energy Rev. 2023, 173, 113105. [Google Scholar] [CrossRef]
Gao, Y.; Li, P.; Yang, H.; Wang, J. A solar radiation intelligent forecasting framework based on feature selection and multivariable fuzzy time series. Eng. Appl. Artif. Intell. 2023, 126, 106986. [Google Scholar] [CrossRef]
Xue, X. Prediction of daily diffuse solar radiation using artificial neural networks. Int. J. Hydrogen Energy 2017, 42, 28214–28221. [Google Scholar] [CrossRef]
Yang, D. Correlogram, predictability error growth, and bounds of mean square error of solar irradiance forecasts. Renew. Sustain. Energy Rev. 2022, 167, 112736. [Google Scholar] [CrossRef]
Huang, C.; Shi, H.; Yang, D.; Gao, L.; Zhang, P.; Fu, D.; Chen, Q.; Yuan, Y.; Liu, M.; Hu, B.; et al. Retrieval of sub-kilometer resolution solar irradiance from Fengyun-4A satellite using a region-adapted Heliosat-2 method. Sol. Energy 2023, 264, 112038. [Google Scholar] [CrossRef]
Ghimire, S.; Deo, R.C.; Casillas-Pérez, D.; Salcedo-Sanz, S. Boosting solar radiation predictions with global climate models, observational predictors and hybrid deep-machine learning algorithms. Appl. Energy 2022, 316, 119063. [Google Scholar] [CrossRef]
Bailek, N.; Bouchouicha, K.; Al-Mostafa, Z.; El-Shimy, M.; Aoun, N.; Slimani, A.; Al-Shehri, S. A new empirical model for forecasting the diffuse solar radiation over Sahara in the Algerian Big South. Renew. Energy 2018, 117, 530–537. [Google Scholar] [CrossRef]
De Souza, J.L.; Lyra, G.B.; Dos Santos, C.M.; Ferreira, R.A., Jr.; Tiba, C.; Lyra, G.B.; Lemes, M.A.M. Empirical models of daily and monthly global solar irradiation using sunshine duration for Alagoas State, Northeastern Brazil. Sustain. Energy Technol. Assess. 2016, 14, 35–45. [Google Scholar] [CrossRef]
Uçkan, İ.; Khudhur, K.M. Improving of global solar radiation forecast by comparing other meteorological parameter models with sunshine duration models. Environ. Sci. Pollut. Res. 2022, 29, 37867–37881. [Google Scholar] [CrossRef]
Alizamir, M.; Shiri, J.; Fard, A.F.; Kim, S.; Gorgij, A.D.; Heddam, S.; Singh, V.P. Improving the accuracy of daily solar radiation prediction by climatic data using an efficient hybrid deep learning model: Long short-term memory (LSTM) network coupled with wavelet transform. Eng. Appl. Artif. Intell. 2023, 123, 106199. [Google Scholar] [CrossRef]
Yang, D. Reconciling solar forecasts: Probabilistic forecast reconciliation in a nonparametric framework. Sol. Energy 2020, 210, 49–58. [Google Scholar] [CrossRef]
Hassan, M.A.; Khalil, A.; Kaseb, S.; Kassem, M.A. Potential of four different machine-learning algorithms in modeling daily global solar radiation. Renew. Energy 2017, 111, 52–62. [Google Scholar] [CrossRef]
Feng, Y.; Gong, D.; Zhang, Q.; Jiang, S.; Zhao, L.; Cui, N. Evaluation of temperature-based machine learning and empirical models for predicting daily global solar radiation. Energy Convers. Manag. 2019, 198, 111780. [Google Scholar] [CrossRef]
Zhao, S.; Wu, L.; Xiang, Y.; Dong, J.; Li, Z.; Liu, X.; Tang, Z.; Wang, H.; Wang, X.; An, J.; et al. Coupling meteorological stations data and satellite data for prediction of global solar radiation with machine learning models. Renew. Energy 2022, 198, 1049–1064. [Google Scholar] [CrossRef]
Dong, J.; Wu, L.; Liu, X.; Fan, C.; Leng, M.; Yang, Q. Simulation of Daily Diffuse Solar Radiation Based on Three Machine Learning Models. Comput. Model. Eng. Sci. 2020, 123, 49–73. [Google Scholar] [CrossRef]
Lee, J.; Wang, W.; Harrou, F.; Sun, Y. Reliable solar irradiance prediction using ensemble learning-based models: A comparative study. Energy Convers. Manag. 2020, 208, 112582. [Google Scholar] [CrossRef]
Ganaie, M.A.; Hu, M.H.; Malik, A.K.; Tanveer, M.; Suganthan, P.N. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
Yagli, G.M.; Yang, D.; Srinivasan, D. Ensemble solar forecasting using data-driven models with probabilistic post-processing through GAMLSS. Sol. Energy 2020, 208, 612–622. [Google Scholar] [CrossRef]
Al-Hajj, R.; Assi, A.; Fouad, M. Short-Term Prediction of Global Solar Radiation Energy Using Weather Data and Machine Learning Ensembles: A Comparative Study. J. Sol. Energy Eng. 2021, 143, 051003. [Google Scholar] [CrossRef]
Zhou, S.; Wang, Y.; Yuan, Q.; Yue, L.; Zhang, L. Spatiotemporal estimation of 6-hour high-resolution precipitation across China based on Himawari-8 using a stacking ensemble machine learning model. J. Hydrol. 2022, 609, 127718. [Google Scholar] [CrossRef]
Fan, J.; Wu, L.; Zhang, F.; Cai, H.; Zeng, W.; Wang, X.; Zou, H. Empirical and machine learning models for predicting daily global solar radiation from sunshine duration: A review and case study in China. Renew. Sustain. Energy Rev. 2019, 100, 186–212. [Google Scholar] [CrossRef]
Fan, J.; Wang, X.; Zhang, F.; Ma, X.; Wu, L. Predicting daily diffuse horizontal solar radiation in various climatic regions of China using support vector machine and tree-based soft computing models with local and extrinsic climatic data. J. Clean. Prod. 2020, 248, 119264. [Google Scholar] [CrossRef]
Abreu, E.F.M.; Gueymard, C.A.; Canhoto, P.; Costa, M.J. Performance assessment of clear-sky solar irradiance predictions using state-of-the-art radiation models and input atmospheric data from reanalysis or ground measurements. Sol. Energy 2023, 252, 309–321. [Google Scholar] [CrossRef]
Buster, G.; Bannister, M.; Habte, A.; Hettinger, D.; Maclaurin, G.; Rossol, M.; Sengupta, M.; Xie, Y. Physics-guided machine learning for improved accuracy of the National Solar Radiation Database. Sol. Energy 2022, 232, 483–492. [Google Scholar] [CrossRef]
Liu, Y.; Zhou, Y.; Chen, Y.; Wang, D.; Wang, Y.; Zhu, Y. Comparison of support vector machine and copula-based nonlinear quantile regression for estimating the daily diffuse solar radiation: A case study in China. Renew. Energy 2020, 146, 1101–1112. [Google Scholar] [CrossRef]
Sun, H.; Gui, D.; Yan, B.; Liu, Y.; Liao, W.; Zhu, Y.; Lu, C.; Zhao, N. Assessing the potential of random forest method for estimating solar radiation using air pollution index. Energy Convers. Manag. 2016, 119, 121–129. [Google Scholar] [CrossRef]
Fan, Y.; Chen, B.; Huang, W.; Liu, J.; Weng, W.; Lan, W. Multi-label feature selection based on label correlations and feature redundancy. Knowl.-Based Syst. 2022, 241, 108256. [Google Scholar] [CrossRef]
Liu, X.; Tang, H.; Ding, Y.; Yan, D. Investigating the performance of machine learning models combined with different feature selection methods to estimate the energy consumption of buildings. Energy Build. 2022, 273, 112408. [Google Scholar] [CrossRef]
Luo, M.; Wang, Y.; Xie, Y.; Zhou, L.; Qiao, J.; Qiu, S.; Sun, Y. Combination of Feature Selection and CatBoost for Prediction: The First Application to the Estimation of Aboveground Biomass. Forests 2021, 12, 216. [Google Scholar] [CrossRef]
Mitrentsis, G.; Lens, H. An interpretable probabilistic model for short-term solar power forecasting using natural gradient boosting. Appl. Energy 2022, 309, 118473. [Google Scholar] [CrossRef]
Bas, J.; Zou, Z.; Cirillo, C. An interpretable machine learning approach to understanding the impacts of attitudinal and ridesourcing factors on electric vehicle adoption. Transp. Lett. 2022, 15, 30–41. [Google Scholar] [CrossRef]
Lundberg, S.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar]
Ding, S.; Huang, W.; Xu, W.; Wu, Y.; Zhao, Y.; Fang, P.; Hu, B.; Lou, L. Improving kitchen waste composting maturity by optimizing the processing parameters based on machine learning model. Bioresour. Technol. 2022, 360, 127606. [Google Scholar] [CrossRef] [PubMed]
Allen, R.G.; Pereira, L.S.; Raes, D.; Smith, M. Crop evapotranspiration-Guidelines for computing crop water requirements-FAO Irrigation and drainage paper 56. Fao Rome 1998, 300, D05109. [Google Scholar]
Fan, J.; Wu, L.; Zhang, F.; Cai, H.; Wang, X.; Lu, X.; Xiang, Y. Evaluating the effect of air pollution on global and diffuse solar radiation prediction using support vector machine modeling based on sunshine duration and air temperature. Renew. Sustain. Energy Rev. 2018, 94, 732–747. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar] [CrossRef]
Li, L.; Qiao, J.; Yu, G.; Wang, L.; Li, H.Y.; Liao, C.; Zhu, Z. Interpretable tree-based ensemble model for predicting beach water quality. Water Res. 2022, 211, 118078. [Google Scholar] [CrossRef]
Zhou, H.; Deng, Z.; Xia, Y.; Fu, M. A new sampling method in particle filter based on Pearson correlation coefficient. Neurocomputing 2016, 216, 208–215. [Google Scholar] [CrossRef]
Ghimire, S.; Bhandari, B.; Casillas-Pérez, D.; Deo, R.C.; Salcedo-Sanz, S. Hybrid deep CNN-SVR algorithm for solar radiation prediction problems in Queensland, Australia. Eng. Appl. Artif. Intell. 2022, 112, 104860. [Google Scholar] [CrossRef]
Talib, A.; Park, S.; Im, P.; Joe, J. Grey-box and ANN-based building models for multistep-ahead prediction of indoor temperature to implement model predictive control. Eng. Appl. Artif. Intell. 2023, 126, 107115. [Google Scholar] [CrossRef]
Markovics, D.; Mayer, M.J. Comparison of machine learning methods for photovoltaic power forecasting based on numerical weather prediction. Renew. Sustain. Energy Rev. 2022, 161, 112364. [Google Scholar] [CrossRef]
Xin, Y. Evolving artificial neural networks. Proc. IEEE 1999, 87, 1423–1447. [Google Scholar] [CrossRef]
Nguyen, B.; Morell, C.; De Baets, B. Large-scale distance metric learning for k-nearest neighbors regression. Neurocomputing 2016, 214, 805–814. [Google Scholar] [CrossRef]
Saqib, M. Forecasting COVID-19 outbreak progression using hybrid polynomial-Bayesian ridge regression model. Appl. Intell. 2021, 51, 2703–2713. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA; pp. 785–794. [Google Scholar]
Lv, M.; Li, H. Nonlinear Chirp Component Decomposition: A Method Based on Elastic Network Regression. IEEE Trans. Instrum. Meas. 2021, 70, 3515813. [Google Scholar] [CrossRef]
Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
Yang, D.; van der Meer, D. Post-processing in solar forecasting: Ten overarching thinking tools. Renew. Sustain. Energy Rev. 2021, 140, 110735. [Google Scholar] [CrossRef]
Kadingdi, F.; Ayawah, P.; Azure, J.; Bruno, K.; Kaba, A.; Frimpong, S. Stacked Generalization for Improved Prediction of Ground Vibration from Blasting in Open-Pit Mine Operations. Min. Metall. Explor. 2022, 39, 2351–2363. [Google Scholar] [CrossRef]
Yang, D. The future of solar forecasting in China. J. Renew. Sustain. Energy 2023, 15, 052301. [Google Scholar] [CrossRef]
Qiu, R.; Liu, C.; Cui, N.; Gao, Y.; Li, L.; Wu, Z.; Jiang, S.; Hu, M. Generalized Extreme Gradient Boosting model for predicting daily global solar radiation for locations without historical data. Energy Convers. Manag. 2022, 258, 115488. [Google Scholar] [CrossRef]
He, C.; Liu, J.; Xu, F.; Zhang, T.; Chen, S.; Sun, Z.; Zheng, W.; Wang, R.; He, L.; Feng, H.; et al. Improving solar radiation estimation in China based on regional optimal combination of meteorological factors with machine learning methods. Energy Convers. Manag. 2020, 220, 113111. [Google Scholar] [CrossRef]
Mohammadi, K.; Shamshirband, S.; Tong, C.W.; Alam, K.A.; Petković, D. Potential of adaptive neuro-fuzzy system for prediction of daily global solar radiation by day of the year. Energy Convers. Manag. 2015, 93, 406–413. [Google Scholar] [CrossRef]
Fan, J.; Wu, L.; Ma, X.; Zhou, H.; Zhang, F. Hybrid support vector machines with heuristic algorithms for prediction of daily diffuse solar radiation in air-polluted regions. Renew. Energy 2020, 145, 2034–2045. [Google Scholar] [CrossRef]
Labani, M.; Moradi, P.; Ahmadizar, F.; Jalili, M. A novel multivariate filter method for feature selection in text classification problems. Eng. Appl. Artif. Intell. 2018, 70, 25–37. [Google Scholar] [CrossRef]
Liu, D.L.; Scott, B.J. Estimation of solar radiation in Australia from rainfall and temperature observations. Agric. For. Meteorol. 2001, 106, 41–59. [Google Scholar] [CrossRef]
Fan, J.; Wang, X.; Wu, L.; Zhou, H.; Zhang, F.; Yu, X.; Lu, X.; Xiang, Y. Comparison of Support Vector Machine and Extreme Gradient Boosting for predicting daily global solar radiation using temperature and precipitation in humid subtropical climates: A case study in China. Energy Convers. Manag. 2018, 164, 102–111. [Google Scholar] [CrossRef]
Ma, J.; Yu, Z.; Qu, Y.; Xu, J.; Cao, Y. Application of the XGBoost Machine Learning Method in PM2.5 Prediction: A Case Study of Shanghai. Aerosol Air Qual. Res. 2020, 20, 128–138. [Google Scholar] [CrossRef]
Patel, S.K.; Surve, J.; Katkar, V.; Parmar, J.; Al-Zahrani, F.A.; Ahmed, K.; Bui, F.M. Encoding and Tuning of THz Metasurface-Based Refractive Index Sensor with Behavior Prediction Using XGBoost Regressor. IEEE Access 2022, 10, 24797–24814. [Google Scholar] [CrossRef]
Jia, D.; Yang, L.; Lv, T.; Liu, W.; Gao, X.; Zhou, J. Evaluation of machine learning models for predicting daily global and diffuse solar radiation under different weather/pollution conditions. Renew. Energy 2022, 187, 896–906. [Google Scholar] [CrossRef]
Wang, L.; Lu, Y.; Zou, L.; Feng, L.; Wei, J.; Qin, W.; Niu, Z. Prediction of diffuse solar radiation based on multiple variables in China. Renew. Sustain. Energy Rev. 2019, 103, 151–216. [Google Scholar] [CrossRef]

Figure 1. Distribution map of solar radiation stations.

Figure 2. Correlation analysis of the prediction error of a single ML model.

Figure 3. Schematics of the stacking model.

Figure 4. Feature importance results of the CatBoost feature selection algorithm: (a) (R_s); (b) (R_d).

Figure 5. Mean absolute SHAP value of the stacking model in Beijing: (a) (R_s); (b) (R_d).

Figure 6. The SHAP value of input features in Beijing: (a) (R_s); (b) (R_d). Note: Red indicates high eigenvalues, and blue indicates low eigenvalues. A SHAP value greater than 0 indicates that the feature has a positive impact on radiation, and a SHAP value less than 0 indicates that the feature has a negative impact on radiation.

Figure 7. Daily R_s evaluation metrics for 17 stations predicted by various ML models during the testing phase.

Figure 8. Daily R_d evaluation metrics for 17 stations predicted by various ML models during the testing phase.

Figure 9. R_s scatter density diagrams for different models at Beijing station.

Figure 10. R_d scatter density diagrams for different models at Beijing station.

Figure 11. Box diagram of evaluation indexes of various ML models in the R_s test stage.

Figure 12. Box diagram of evaluation indexes of various ML models in the R_d test stage.

Figure 13. Taylor diagram of Beijing application model to predict R_s.

Figure 14. Taylor diagram of Beijing application model to predict R_d.

Table 1. Related information on different radiation stations.

ID	Station	Latitude (°N)	Longitude (°E)	Altitude (m)	Climatic Zone	Koppen–Geiger Climate
50136	Mohe	52.58	122.31	438.5	TMZ	Dw
50953	Harbin	45.56	126.34	118.3	TMZ	Dw
51463	Urumqi	43.47	87.39	1930	TCZ	Bs
51709	Kashgar	39.29	75.45	1385.6	TCZ	Bw
52267	Ejin Banner	41.57	101.04	940.5	TCZ	Bw
52983	Yuzhong	35.52	104.09	1874.1	TMZ	Dw
54342	Shenyang	41.44	123.31	49.0	TMZ	Dw
54511	Beijing	39.48	116.28	45.8	TMZ	Bs
55591	Lhasa	29.40	91.08	8658	MPZ	Bs
56187	Wenjiang	30.45	103.52	548.9	SMZ	Cf
56778	Kunming	25.00	102.39	1888.1	SMZ	Cf
57083	Zhengzhou	34.43	113.39	110.4	TMZ	Dw
57494	Wuhan	30.36	114.03	23.6	SMZ	Cf
57816	Guiyang	26.35	106.44	1223.8	SMZ	Cf
58362	Shanghai	31.24	121.27	2.8	SMZ	Cf
59287	Guangzhou	23.13	113.29	70.7	TPMZ	Cf
59948	Sanya	18.13	109.35	5.0	TPMZ	Aw

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, Y.; Wang, Y.; Li, Z.; Zhao, L.; Shi, Y.; Xing, X.; Chen, S. Improving Solar Radiation Prediction in China: A Stacking Model Approach with Categorical Boosting Feature Selection. Atmosphere 2024, 15, 1436. https://doi.org/10.3390/atmos15121436

AMA Style

Ding Y, Wang Y, Li Z, Zhao L, Shi Y, Xing X, Chen S. Improving Solar Radiation Prediction in China: A Stacking Model Approach with Categorical Boosting Feature Selection. Atmosphere. 2024; 15(12):1436. https://doi.org/10.3390/atmos15121436

Chicago/Turabian Style

Ding, Yuehua, Yuhang Wang, Zhe Li, Long Zhao, Yi Shi, Xuguang Xing, and Shuangchen Chen. 2024. "Improving Solar Radiation Prediction in China: A Stacking Model Approach with Categorical Boosting Feature Selection" Atmosphere 15, no. 12: 1436. https://doi.org/10.3390/atmos15121436

APA Style

Ding, Y., Wang, Y., Li, Z., Zhao, L., Shi, Y., Xing, X., & Chen, S. (2024). Improving Solar Radiation Prediction in China: A Stacking Model Approach with Categorical Boosting Feature Selection. Atmosphere, 15(12), 1436. https://doi.org/10.3390/atmos15121436

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving Solar Radiation Prediction in China: A Stacking Model Approach with Categorical Boosting Feature Selection

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection and Processing

2.2. Evaluation of Model Input Characteristics

2.2.1. CatBoost Feature Selection Algorithm

2.2.2. Shapley Additive Explanation

2.3. Learner Selection for Stacking

2.3.1. Support Vector Regression (SVR)

2.3.2. Artificial Neural Networks (ANNs)

2.3.3. K-Nearest Neighbor (KNN)

2.3.4. Bayesian Ridge Regression (Bayesian)

2.3.5. Extreme Gradient Boosting (XGBoost)

2.3.6. Elastic Network Regression (ElasticNet)

2.3.7. Stacking

2.4. Performance Evaluation

3. Results and Discussion

3.1. Selection Results of the CatBoost Feature Selection Algorithm

3.2. Shapley Additive Explanation (SHAP) Analysis

3.3. Performance of Different ML Models in Radiation Estimation

3.4. Solar Radiation Performance of the Stacking Model in Different Regions

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Nomenclature

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI