Impact on Predictive Performance of Air Pollutants in PV Forecasting Using Multi-Model Ensemble Learning: Evidence from the Port Logistics Hinterland Area

Jungmin Ahn; Juyong Lee

doi:10.3390/systems13110943

and

Department of Industrial and Systems Engineering, College of Engineering, Changwon National University, Changwondaehak-ro 20, Changwon-si 51140, Republic of Korea

^*

Author to whom correspondence should be addressed.

Systems2025, 13(11), 943;https://doi.org/10.3390/systems13110943

This article belongs to the Section Artificial Intelligence and Digital Systems Engineering

Version Notes

Order Reprints

Abstract

The uncertainty of photovoltaic (PV) power generation can impact the stability and flexibility of the power grid. Thus, accurately forecasting PV power output is crucial for ensuring a stable power system and supporting next-generation policy decisions. The purpose of this study is to examine how the PV power generation forecasting model performed both with and without the addition of particulate matter (PM) and greenhouse gas (GHG) concentration factors with meteorological data. In this study, PV power generation is forecasted by models based on various machine learning models. The results indicate that there was no significant difference in forecasting accuracy whether PM and GHG variables were included or not. In addition, the stacked ensemble model has the lowest root mean square error (RMSE) and mean absolute error (MAE) values for all datasets and shows improved performance compared to the single model. Stacked ensemble that include a combination of meteorological, PM, and GHG variables perform the best. However, the optimal datasets varied across models. Therefore, this study concluded that meteorological variables had the greatest influence on the PV generation forecasting performance. Among the additional factors, PM contributed more significantly to the improvement in forecasting performance than GHG.

Keywords:

photovoltaic power; port hinterlands; particulate matter; greenhouse gas; stacked ensemble model

1. Introduction

Achieving global carbon neutrality requires a clean renewable energy transition [1,2]. Many countries have been striving to reduce greenhouse gas (GHG) emissions and increase the share of renewable energy in its energy mix. In Korea, the government released the 2050 Carbon Neutrality Strategy in October 2021 to achieve the goal of net zero in domestic carbon emissions by 2050. The strategy includes improving the electricity demand management, permanently shut-downing coal-fired power generation, utilising additional sources of natural gas power generation, and expanding renewable energy generation to 70.7% in the Korea energy mix [3]. The Korean government has implemented a photovoltaic-centred renewable energy development policy, with a share of photovoltaic power generation that is higher than that of other OECD countries [4]. Due to the advantages of photovoltaic power generation being easily scalable, cost-effective, and free from resource depletion, the number of photovoltaic power facilities is increasing rapidly worldwide [5]. In South Korea, it is predicted that by 2027, total renewable energy generation capacity will double from 2022, with photovoltaic power accounting for 85% [6]. Thus, photovoltaic power plays an essential role in the next-generation energy policy in Korea, and reliable power generation with low intermittency is important for effective policy implementation.

However, the output of grid-connected photovoltaic power systems is inherently intermittent and fluctuates depending on insolation, weather, and other environmental factors [7]. Generally, the inclusion of large-scale and volatile renewable energy sources decreases the flexibility in the power system, and Korea’s high share of photovoltaic power generation can increase the volatility of the electricity supply and demand. This leads to difficulties in managing the supply reserve margin and grid operations and complexity in electric power market bidding [8,9].

Furthermore, among the environmental factors, high concentrations of particulate matter (PM) can absorb or scatter sunlight before it reaches the solar panels, reducing the solar radiation and power generation [10]. South Korea’s PM_2.5 concentrations have continued to decline since 2019, but they still remain high compared to major OECD countries [11]. The efficiency of photovoltaic power generation decreases with increasing PM concentration. The PM-induced decrease in the solar PV power generation reaches 22.6% and 22.0% under ‘bad’ air quality conditions based on the concentration of PM_2.5, 75 μg m⁻³ and PM₁₀ in Korea [12]. In addition, in contrast to previous studies that have shown that GHG (NO₂, O₃, CO, etc.) have negligible impact on electricity generation due to their very small atmospheric concentrations, according to the International Energy Agency (IEA), global GHG emissions per capita continued to rise through 2022 [13,14]. Since climate change is influenced by the concentration of GHG in the atmosphere, this can also affect the efficiency of photovoltaic power generation, which is dependent on climatic variables [15]. Therefore, in this study, we considered PM (PM₁₀, PM_2.5) and GHG (SO₂, NO₂, O₃, CO) concentration data, along with meteorological variables, in the solar power generation forecasting model to analyse their impact on the model’s accuracy. The results of this study suggest that improving the accuracy of solar power generation forecasts can mitigate the uncertainty of renewable energy, making them applicable for the development of future renewable energy generation forecasting systems. Additionally, it is anticipated that these advancements will contribute to the stable and efficient operation of the power grid to some extent.

PV power forecasting modelling approaches can be categorized into physical, statistical, artificial intelligence (AI), and hybrid, involving ensemble methods [16]. Physical methods typically rely on numerical atmospheric data for weather forecasting and are primarily developed through Numerical Weather Prediction (NWP) [17]. These models demonstrate optimal performance in situations where meteorological conditions remain stable. Statistical techniques for general time series models are created by combining their own values from the preceding time step in a linear way [18,19]. PV power data, on the other hand, displays intricate time-series features with dynamic and irregular patterns. As such, it is still difficult to achieve accurate PV power forecasting using statistical techniques. AI methodologies, specifically machine learning and deep learning models, demonstrate excellent performance in handling non-linear data and are consequently the primary methods utilized in current solar power generation forecasting. This approach is favoured for its simplicity in modelling while providing high prediction power and generalization performance, leading to active research on diverse forecasting models worldwide [20,21,22]. For this reason, a machine learning-based model was used for this study to forecast the generation of solar power. Table 1 presents a brief description of the previous studies on photovoltaic power generation forecasting.

Table 1. Summary of related previous studies.

Sharadga et al. [23] compared statistical models and neural networks for forecasting solar power generation in large-scale plants. When applied to the forecasting of PV power output, the artificial neural network (ANN) demonstrates superior accuracy and computational efficiency compared to other suggested statistical models such as the autoregressive integrated moving average (ARIMA) and seasonal ARIMA (SARIMA), with bidirectional long short-term memory (Bi-LSTM) being the most accurate model in this study. Ahmed et al. [16] also demonstrated that the ANN-based model performed best for short-term solar power generation forecasting. Within the NN models, studies related to the Recurrent Neural Network (RNN) model include the following: Jung et al. [24] presented a stacked long short-term memory-RNN (LSTM-RNN) model for monthly PV power output forecasting in a new generation system for South Korea. A stacked LSTM-RNN model with memory cells that retain temporal effects during the learning process enabled the capture of temporal patterns in monthly datasets. Liu et al. [25] concluded that long short-term memory (LSTM) was better than multi-layer perceptron (MLP), regardless of the training dataset. Mellit et al. [26] compared different deep neural network (DNN) models for short-term PV power output forecasting. LSTM and gated recurrent unit (GRU) models had a lower root mean square error (RMSE) values and less deviation from the mean than convolutional neural network (CNN) ensemble models. The single LSTM model achieved the best performance with an RMSE of 0.16. Li et al. [14] introduced a new method using Bi-LSTM to improve the weather forecast temporal resolution from three hours to one hour. They used a GRU-CNN model to capture both temporal and spatial correlations among solar power plant observations in nearby areas. Additionally, Lee and Kim [27] also found that in all experimental settings, the single GRU model performed better than the LSTM, ANN, and DNN-based models. Although the GRU and LSTM sometimes perform similarly, the GRU has demonstrated reliable and consistently positive outcomes regardless of the data. Kazem et al. [28] utilized a fully recurrent neural network (FRNN) and discovered that the FRNN outperformed ANN-based models and PCA in learning data patterns. Khan et al. [29] introduced the deep stacked ensemble-XGBoost (DSE-XGB) model, which combines extreme Gradient Boosting (XGBoost), ANN, and LSTM, and it outperforms individual deep learning algorithms across different weather conditions due to its strong base learners, showing better consistency and stability. Rahman et al.’s [30] findings reveal that LSTM excels in short-term forecasting, with increasing errors for longer forecast periods. However, multivariate LSTM models outperform univariate ones in long-term forecasting, highlighting the effectiveness of using multivariate LSTM models for improved accuracy in extended forecasts. Sarmas et al. [31] introduced four LSTM models and a hybrid meta-model for one-hour-ahead solar generation forecasting. The stacked LSTM (StackLSTM) demonstrated the highest accuracy, followed by Convolutional Long Short-Term Memory (ConvLSTM) and Bi-LSTM. Although not shown in the table, Wang and Jia [36] introduced the light gradient boosting machine-LSTM (LightGBM-LSTM) model and compared it to XGBoost, light gradient boosting machine (LightGBM), and LSTM using real solar energy data from a Chinese solar power plant. Their findings indicate that LightGBM-LSTM outperformed the other models in prediction power. While XGBoost and LightGBM exhibited similar performance, both surpassed the LSTM in prediction power, prompting further exploration of their use in this context. Malakouti et al. [32] reported that Decision Tree, LightGBM, and Extra Tree achieved comparable high R² values close to 1 but differed in their runtime. LightGBM had the shortest runtime. Ye et al. [33] also proposed LightGBM-XGBoost for forecasting the PV generation of multiple power plants with different conditions simultaneously. Furthermore, a wind power generation forecasting model based on the LightGBM was used, and it outperformed all the other models [37,38]. As a study of ensemble or hybrid models, Cao et al. [34] introduced a novel stacking ensemble algorithm, the LSTM-Informer model. This model leverages the strengths of LSTM and Informer models, catering to short-term and long-term time series forecasting, respectively, by extracting time-dependent information of various scales and generating meta-features. Babalhavaeji et al. [35] were the first to propose a novel PV generation forecasting, called CNN-GRU. The former learns the spatial features, and the latter captures temporal information associated with the data. CNN-GRU are the closest to the real values, and CNN-LSTM achieves the second-best forecasting results. In addition, ensemble models are better than single models.

To summarize the previous studies,

(1): Machine learning algorithms often surpass traditional time series models in performance. However, the superiority of a specific machine learning model is not consistent, as it varies across different research studies.
(2): Recent studies highlight the RNN network’s superior efficiency over traditional ANNs across varied condition. In addition, RNN and CNN-based models are widely used, where the former is used to identify the temporal characteristics of the power generation data, and the latter is used to identify the spatial characteristics.
(3): Typically, ensemble and hybrid models have better prediction power than single models.
(4): Research on forecasting solar power generation often focuses on meteorological data, with limited attention to the impact of PM and GHG.

This study uses machine learning and deep learning models to forecast solar power generation. Although various models have been actively proposed in previous studies, this study aims to compare the prediction power of input variables using models mainly utilized in existing studies. While the LSTM is regarded as one of the top-performing RNN-based algorithms, and numerous derivatives have been created, we will employ the standard LSTM model due to its extensive usage in existing research. We explored LightGBM and XGBoost, which perform similarly to LSTM in forecasting, and also investigated a stacking ensemble model based LightGBM-XGBoost. However, CNN models are not considered, because we forecast the overall average of hourly solar power generation in the Gyeongnam region of South Korea, which has fixed geographical and spatial factors.

The main purpose of this study is twofold: (1) to find the most accurate solar power generation forecasting model among LSTM, LightGBM, XGBoost, and LightGBM-XGBoost models and to verify that the prediction power of the ensemble model shows a significant improvement over the existing single model; (2) to examine how the model performed both with and without the addition of PM and GHG concentration factors with meteorological data. Hence, the grid operator can consider the optimal model based on the available data. The remainder of this study is organized as follows: Section 2 introduces the model and metrics used. Section 3 describes the data used and the forecasting results of the model. Finally, Section 4 presents the conclusions and limitations of the study.

2. Methodology

In this study, we used Light GBM, XGBoost, and LSTM as single deep learning models and also considered a stack ensemble model that combines Light GBM and XGBoost. The following subsections briefly describe the models used in this study and measurement statistics to evaluate the prediction power of models and variables. Figure 1 presents a flowchart of the research steps.

Figure 1. Flowchart of research steps.

2.1. Light GBM

Light GBM is a distributed gradient boosting decision tree that is established on a decision-tree technique proposed by Microsoft Research [39]. Light GBM is a powerful approach for solving regression, classification, and other machine learning issues. Its core concept is to combine N weak regression trees into a robust regression tree through linear combination. The calculation formula is

F (x) = \sum_{n = 1}^{N} f_{n} (x),

(1)

where

F (x)

is the final output, and

f_{n} (x)

is the output of the

n

th weak regression tree.

With a calculation speed 10 times faster than the original gradient boosting decision tree (GBDT) method and requiring only one-third of the memory, Light GBM not only uses less memory but also excels in prediction power [40,41]. Two innovative techniques are included in the Light GBM algorithm: exclusive features bundling (EFB), which aims to manage many data features without raising issues with overfitting, and gradient-based one-side sampling (GOSS), which is thought to manage a large dataset. The primary advancements in the Light GBM model involve the histogram algorithm and the leaf-wise strategy with depth limitation. The histogram algorithm discretizes continuous data into K integers, constructing a histogram with a width of K. As values are traversed, they accumulate in the histogram, serving as indices for optimal decision tree split points. The leaf-wise strategy, limited in depth, prioritizes splitting at the leaf with the highest gain during each iteration. The leaf-wise growth approach is particularly effective, as it splits only the leaf with the highest information gain on the same layer, but it allows for the induction of trees with significant depth. Simultaneously, constraints on tree depth and leaf count reduce the model complexity, preventing overfitting.

2.2. XGBoost

The XGBoost is an algorithm based on GBDT. XGBoost constructs a robust regression model by iteratively generating and combining multiple weak tree models, primarily classification regression trees [42]. The algorithm first builds a certain number of weak learners, training them on the residual errors obtained from the previous iteration. Lastly, the forecasting result of each decision tree are summed to create the final regression model. This decision tree-based approach ensures that each new learner, built on the gradient, contributes to reducing the overall model error. When forecasting PV power, regression tree forecasting can be created using XGBoost [43]. Assume that

x_{i}

is an input feature, and the targeted value

y_{i}

in training sample dataset

D

can be written as follows:

D = \{x_{i} {, y}_{i}\} (i = 1, 2, . . ., n) .

(2)

Then, the

i

th sample’s forecasting function is defined as follows:

{\hat{y}}_{i} = \sum_{m = 1}^{M} f_{m} (x_{i}), f_{m} \in F,

(3)

where

f_{m} (x_{i})

is the discriminant function of the

m

th tree to the

i

th data, and the integration of

m

decision tree models is represented by the robust model

F

.

XGBoost (version 3.1.0) introduces the L1 and L2 regularization terms. On the loss function, XGBoost applies a second-order Taylor expansion. Column sampling is supported by XGBoost in order to minimize computation and avoid overfitting. XGBoost distributes the learning speed among the leaf nodes, lowers the weight of each tree, and improves the available space for further learning at the end of each iteration [44]. XGBoost uses a horizontal decision tree growth strategy, while LightGBM uses a leaf-oriented algorithm with depth limitations.

2.3. LSTM

The RNN model known as LSTM was introduced by Hochreiter and Schmidhuber [45] and is widely used for time series data forecasting and learning. If there is a large distance between the relevant information and the point of using the information, the classic RNN is known to have a substantially lower gradient while backpropagating. The vanishing gradient problem is the resulting considerable decline in learning performance. By including a cell state in addition to the hidden state of a standard RNN, LSTM is suggested as a solution to this issue [46,47]. LSTM excels in preserving and utilizing crucial information across prolonged time intervals, showcasing its proficiency in handling long-term dependencies [48]. The formulations of the LSTM structure are presented as follows:

f_{t} = σ (W_{f x} x_{t} + W_{f h} h_{t - 1} + b_{f}), i_{t} = σ (W_{i x} x_{t} + W_{i h} h_{t - 1} + b_{i}), g_{t} = t a n h (W_{g x} x_{t} + W_{g h} h_{t - 1} + b_{g}), o_{t} = σ (W_{o x} x_{t} + W_{o h} h_{t - 1} + b_{o}), c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ g_{t}, h_{t} = o_{t} ⊙ t a n h (c_{t}),

(4)

where

c_{t - 1}

is the memory cell state, the hidden state

h_{t - 1}

and input state

x_{t}

interact with the LSTM gates, such as the input gate

i_{t}

, input node

g_{t}

, forget gate

f_{t}

, and output gate

o_{t}

;

⊙

is a Hadamard product or elementwise multiplication;

σ

denotes the sigmoid activation function; and

W a n d b

are the weight matrices and bias vector parameters, respectively, which are learned during training.

To learn, the network uses input, forget, and output gates to update, preserve, or erase the

t -

1 prior state information. The LSTM model’s basic structure is shown in Figure 2.

Figure 2. Structure of LSTM.

2.4. Stacked Ensemble

Stacking ensemble models improve the prediction power beyond the capabilities of individual models. This method involves training a meta-learner to optimally combine the forecasts of multiple base models. Initially, the different models are trained individually; then, their forecasts are used as input to the meta-learner. The forecasts of the base models, generated by K-fold cross-validation in two-layer stacking, form a new feature set. Along with the original labels, this set is used to train the meta-learner using a simple algorithm like linear regression to generate the final model. The meta-leaning layer is the key to the strength of stacking. Its effective strength of stacking lies in the meta-learning layer, which effectively identifies the strengths and weaknesses of each base model, strongly improving their performance and generalization ability [49]. This paper proposes a generally applicable stacked ensemble algorithm utilizing two deep learning algorithms, LightGBM and XGBoost, as base models for PV generation forecasting. The meta-model is LightGBM. The formula of the stacking model is as follows: Suppose the dataset D = {

(x_{i}, y_{i})

} (i = 1, 2, 3, …, n), the first-layer base predictor is

{ζ_{n}}

. The second-layer predictor is

θ

and the final output takes the form:

H (x) = θ (ζ_{1} (x), ζ_{2} (x), \dots, ζ_{n} (x)) .

(5)

2.5. Evaluation of the Prediction Power

The ensemble scheme and base models are evaluated using four commonly employed error metrics to estimate the model accuracy in the literature the coefficient of determination (R²), root mean square error (RMSE), mean absolute error (MAE), and mean squared error (MSE). Even though the mean absolute percentage error (MAPE) is a popular performance evaluation metric, the MAPE exhibits asymmetry, where errors above the original value incur a larger absolute percentage error than those below [50]. In addition, the MAE shows higher sensitivity to the power calculation method compared to the RMSE, likely because the MSE is more responsive to elevated errors from irradiance forecasts and less influenced by the power conversion method [51]. Therefore, this study evaluates the prediction power of each model and provides both the RMSE and MAE. The formulations of the RMSE, MAE, and R² are as follows:

R M S E = \sqrt{\frac{1}{N} \sum_{t = 1}^{N} {({\hat{y}}_{t} - y_{t})}^{2}},

(6)

M A E = \frac{1}{N} \sum_{t = 1}^{N} |y - \hat{y}|,

(7)

R^{2} = 1 - \frac{\sum_{t = 1}^{N} {(y_{t} - {\hat{y}}_{t})}^{2}}{\sum_{t = 1}^{N} {(y_{t} - \bar{y})}^{2}},

(8)

where

y_{t}

is the actual value,

{\hat{y}}_{t}

is the predicted value, and N is the sample size. The MSE is the squared value of the RMSE.

3. Data and Results

3.1. Dataset

In order to predict solar power generation, it is necessary to select the PV power plants to be analysed. This study selected data from the power plants located in Gyeongsangnam-do (Gyeongnam), which is the hinterland area of the new port of South Korea. The hourly average power generation data of all PV plants located in the region were collected from the Korea Power Exchange (KPX), and the meteorological data were collected from the National Climate Data Centre. The meteorological data were collected from the hourly meteorological observation data of individual stations located in Gyeongnam; so, the observations of all stations in the same time period were averaged and reconstructed into hourly average meteorological data for the entire Gyeongnam region. Figure 3 shows the correlation heatmap between the collected meteorological variables (11 in total, including PV, temperature, precipitation, wind speed, and wind direction). It can be seen that there are variables with a high correlation of more than 0.6, which is likely to cause multicollinearity problems. Multicollinearity can reduce the predictive performance of the model and cause overfitting when there is a high correlation between independent variables in a regression analysis. It can also distort the importance of variables; so, this study used Elastic Net for feature selection (dimensionality reduction) [52].

Figure 3. Correlation matrix plot of meteorological features.

3.2. Elastic Net

The Elastic Net is a regularization technique that combines the features of Ridge Regression and Lasso Regression, making it particularly useful for high-dimensional datasets where multicollinearity is a concern. Elastic Net addresses this issue by incorporating both

L 1

and L2 penalties, thus performing variable selection and regularization simultaneously. This method offers a flexible and robust approach to model building, especially when dealing with complex datasets. The objective function of the Elastic Net is defined as follows:

{m i n}_{β_{0}, β} \{\frac{1}{2 N} \sum_{i = 1}^{N} {(y_{i} - (β_{0} + x_{i}^{T} β))}^{2} + λ (\frac{1 - α}{2}) \sum_{j = 1}^{p} β_{j}^{2} + α \sum_{j = 1}^{p} |β_{j}|\},

(9)

where N is the number of samples,

y_{i}

represents the response variables and

x_{i}

denotes the explanatory variables. p is the number of input features.

β_{0}

is the intercept, and

β_{j}

represents the regression coefficients.

λ

is the parameter to adjust the regularization.

α

is the parameter that balances the contribution of the L1 penalty (lasso) and the L2 penalty (ridge) and takes a value between 0 and 1 [53,54].

Thus,

α

balances the L1 and L2 penalties, while

λ

controls the overall strength of regularization. A higher

α

favors Lasso’s variable selection, and a higher

λ

increases the regularization, affecting all coefficients. In the end, two meteorological variables, precipitation and dewpoint, were excluded at the

α

= 0.001 level. In this study, hourly data for 3 years and 7 months from 1 January 2020 at 00:00 to 31 July 2023 at 23:00 was used as the training data for model training, and the test period to evaluate the performance of each model was set from 1 August 2023 at 00:00 to 31 August 2023 at 23:00. The data collected in the above procedure are summarised in Table 2.

Table 2. Table of variables.

The data of particulate matter and GHG concentrations were obtained from the national real-time air pollution level and observations by measurement stations of Air Korea, Korea’s environmental agency. The purpose of this study is to analyse the impact of particulate matter and GHGs in establishing a PV generation forecasting model. Thus, there are four scenarios of datasets to test the validity of the variables: (1) meteorological variables; (2) meteorological variables and particulate matter concentrations; (3) meteorological variables and GHG concentrations; and (4) total variables (time variables are included in all datasets). In addition, all analyses in this study were performed using Python 3.11.

3.3. Comparison Solar Power Forecasting Model Performance

In this study, various combinations of predictor variables were analysed in order to determine the variables that help to improve the forecasting of solar power generation. The parameters of the models were optimized according to the training data, using a grid search method and Bayesian optimization. To compare the prediction power of each model, this study calculated the performance evaluation factors of each model, and the results are presented in Table 3. Among the single models, LightGBM has the lowest RMSE value across all datasets, with the best performance in the ‘Weather+PM/GHG’ dataset with an RMSE of 11.2146. This suggests that LightGBM is able to maintain high predictive power even with complex data structures. In comparison, the XGBoost model had an average RMSE value of about 1.08 higher than LightGBM, but a close explanatory power of 0.9803 with an R-squared value. The LSTM had the highest error, with a maximum of 19.6150 and MAE of 14.8621, but a relatively high R-squared value, indicating that it captured the variability in the data well. The stacked ensemble model consistently outperformed on all datasets. In particular, when comparing the RMSE of LightGBM and XGBoost for the ‘Weather’ dataset, the error reduction effect of the ensemble model was clearly observed, with the Stacked Ensemble model achieving 3.76% and 6.26% lower RMSE values, respectively. In addition, the ‘Weather + PM/GHG’ dataset showed the lowest error with an RMSE of 10.7245 and MAE of 6.4501, while the R-squared value was 0.9862, demonstrating the stability and predictive accuracy of the model. On the ‘Weather + PM’ dataset, the Stacked Ensemble model achieved a lower RMSE (10.9826), which is about 0.52% lower than the ‘Weather’ dataset, indicating that the PM variable has a relatively more positive impact on the model’s performance, but the impact is not significant. The performance improvement of the Stacked Ensemble model on the ‘Weather + GHG’ dataset was insignificant. The LightGBM and XGBoost models have a 0.33% and 4.79% lower RMSE, respectively, compared to the ‘Weather’ dataset, while the Stacked Ensemble has a 0.37% reduced RMSE. This suggests that the GHG variable is not a significant factor in the solar power forecast. Overall, on the ‘Weather + PM/GHG’ dataset, the Stacked Ensemble model performed with a 2.87% reduction in RMSE compared to the ‘Weather’ dataset, compared to a 0.31% reduction for LightGBM. A similar pattern was observed for the MAE metric. The Stacked Ensemble model showed an improvement of 3.12%, suggesting that ensemble models can better integrate the effects of complex environmental variables than single models to improve prediction performance. Thus, we observed an improvement in the performance of the models that included the particulate matter and GHG concentration variables, but the XGBoost and LSTM models tended to perform worse. This indicates that certain models can effectively learn the impact of environmental variables to improve their prediction performance, while others can be degraded by unnecessary information. In addition, the performance of the models that included GHG concentrations showed similar average performance differences in terms of RMSE to those that did not include the PM variable, suggesting that PM and GHG concentrations do not have a significant impact on solar power generation forecasting.

Table 3. Performance evaluation factors of each model.

3.4. Comparison of Detailed Structure and Performance of Stacked Ensemble Models

One of the key objectives of this study is to determine the optimal model by analysing the structure and forecasting performance of the stacked ensemble models based on different datasets. An ensemble model is a model that combines the forecasting results of multiple base models to derive a final forecasting, and this study used LightGBM and XGBoost as the base models and LightGBM as a meta-model. The detailed structure of the best model for each variable in the stacked ensemble model is shown in Table 4. The ratios in parentheses in this table represent the relative weight contributions of each base model estimated by the meta-learner in the stacked ensemble. In addition, the actual and forecasted values for the evaluation data of the PV power generation forecasting model for each combination of variables in the stacked ensemble model are described in Figure 4. The horizontal axis represents the time index of the test data, and the vertical axis represents the PV power generation. Here, the blue line represents the actual value, the orange line represents the forecasted value, and the gap between the blue and orange lines on the same vertical axis represents the error of the PV power generation.

Table 4. Detailed structure of the stacked ensemble model.

Figure 4. Graph of the stacked ensemble model’s PV power forecasting: (a) Weather; (b) Weather + PM; (c) Weather + GHG; (d) Weather + PM + GHG.

Using the ‘Weather’ dataset, the LightGBM has a weight of 83.81%, and the XGBoost has a weight of 16.19%. The RMSE is 11.0401, and the MAE is 6.6573, suggesting that LightGBM provides better predictive performance as a single model compared to XGBoost. For the ‘Weather + PM’ dataset, the contribution of LightGBM decreased slightly to 75.93%, while the contribution of XGBoost increased to 24.07%. This indicates that the particulate matter variable is a better fit for the XGBoost model, with an RMSE of 10.9826 and a MAE of 6.5692, improving the overall model performance compared to weather variables alone. Conversely, in the ‘Weather + GHG’ dataset, the proportion of LightGBM increased to 86.90%, indicating that the GHG variable enhanced the predictive ability of LightGBM.

When using the comprehensive ‘Weather + PM/GHG’ dataset, the LightGBM share is 88.40%, which is the highest LightGBM share among the four stacked ensemble models. The best-performing stacked ensemble model on the ‘Weather + PM/GHG’ dataset has the lowest RMSE of 10.7245 and MAE of 6.4501. The fact that the stacked ensemble model with the largest weight of LightGBM performed the best suggests that LightGBM played a key role in accurately capturing and predicting the complex interactions in the multivariate data. Because LightGBM is a decision tree-based model and has high flexibility in attribute importance and data partitioning, LightGBM predicted more accurately for environmental data with complex interactions and non-linear patterns, as evidenced by improvements in the RMSE and MAE metrics.

XGBoost avoids the overfitting that can occur in LightGBM and contributes to improving the generalisation ability of the overall model by adding diversity. Thus, this study confirms that a stacked ensemble model consisting of LightGBM and XGBoost provides robust predictive performance on multivariate datasets with increased complexity, including a variety of environmental variables.

3.5. Discussion

As shown in Figure 5, prior to 9 August 2023, South Korea was experiencing a heat wave and persistent super tropical nights, which led to a spike in cooling demand. Then, on 9 and 10 August, all heat warnings were lifted as the country entered the path of Typhoon Kanun, which temporarily reduced the cooling demand and solar radiation as most areas recorded temperatures below 30 degrees Celsius. As a result, the PV power generation was relatively low on 9 and 10 August. However, after the typhoon passed, temperatures soared to a high of 36.8 °C in South Korea on the 11th, and heat warnings and advisories were issued nationwide. On 13 August, the Korea Meteorological Administration reported that 11.7 days of heatwave occurred. This exceeded the average number of heat wave days per year (11.0 days). As such, Korea’s summer season is characterised by a series of typhoons and heat waves, and proactive extreme weather forecasting is crucial to ensure stable power supply during peak hours and surging electricity demand [55].

Figure 5. Daily average trend of PV generation and temperature in Gyeongnam (August 2023).

An example of a power outage caused by an extreme heat wave is the 2020 California blackout [56]. In August 2020, the U.S. state of California experienced high temperatures of over 38 degrees Celsius, which led to a surge in cooling demand. The power system was overloaded, and a rotating blackout was implemented to prevent a large-scale blackout. Rotating blackouts are short controlled power outages mandated by the California Independent System Operator (CAISO) to mitigate spikes in electricity demand by sequentially shutting off power during times of peak usage. The regulation seeks to improve the stability of the statewide electricity system by easing the demand on the entire electricity supply during times of peak usage to prevent wider and longer outages. Solar power is particularly vulnerable to rapidly decreasing efficiency after sunset and to spikes in demand for refrigeration power after 19:00 due to the tropical night phenomenon. In 2020, California normally imports power from the neighbouring states of Nevada and Arizona, but those states also suffered from heat waves and were unable to import power. As a result, California’s power shortage was caused by a failure to forecast the surge in electricity demand caused by the heat wave and a lack of reserves due to radical energy transition policies. Therefore, proactive extreme weather forecasting is crucial, as is improving the accuracy of solar power generation forecasts to cope with power peaks. It is possible to consider increasing the share of solar power by adjusting the generation mix to account for the increase in solar power generation in proportion to the increased insolation during heat waves [57].

While 25 degrees Celsius is the optimum temperature for solar panels to generate maximum power [58], solar power can still generate enough electricity during heat waves (daytime highs of 33 degrees Celsius or higher). For example, in South Korea in August 2023, solar power met 20.0 per cent of the country’s electricity demand during a sustained heat wave. On 3 August, South Korea experienced a surge in electricity demand due to the heatwave, with some areas experiencing ultra-tropical nighttime temperatures of over 30 degrees Celsius. From 0:00 to 13:00, the solar power generation was 17,843 MW, accounting for 20 per cent of the total actual demand (89,213 MW). From 13:00 to 14:00, the total demand was recorded at 91,718 MW and solar power generation at 17,594 MW (19.2%), by the Korea Power Exchange (KPX). Increased solar power generation during peak hours in the summer means that additional power can be supplied during peak hours by solar power instead of LNG, which emits greenhouse gases. In addition, until 2016, the peak hours for summer electricity consumption in South Korea were 14–15 h, which has shifted to 16–17 h since 2021 (Figure 6). As of July 2021, the total solar capacity in South Korea was about 20.3 GW, comprising solar power traded on the electricity market (5.1 GW), solar power generated through power purchase agreements (PPAs) with KEPCO (11.5 GW), and solar power for home use (3.7 GW). Of these, 5.1 GW (about 25 per cent) is registered in the electricity market, with real-time metering. The remaining 75% is unmetered and is only accounted for in demand forecasts. Therefore, the shift in peak demand to the late afternoon is a result of unmetered solar covering the midday demand, and it can be concluded that solar power is making a significant contribution to demand reduction and supply capacity.

Figure 6. Actual electricity demand and market demand by time of day.

4. Conclusions

In this study, a machine learning-based solar power generation prediction model was established by considering the effects of meteorological variables, PM, and greenhouse gas concentrations together. Based on the Light GBM-XGBoost ensemble model, the RMSE value is reduced, and it is helpful to improve the prediction performance by considering the combination of four variables. However, XGBoost and LSTM show opposite results, and the difference in prediction performance is not significant; so, it is reasonable to predict solar power generation by considering only weather variables. The significance of this study is that it considers the concentration of PM and greenhouse gases together in a machine learning-based solar power generation prediction model using meteorological data. In addition, this study examined the combination of meteorological variables to obtain the optimal power generation prediction model, and among the meteorological variables, precipitation and dew point information were found to be factors that reduced the performance of the prediction. These results are expected to be useful in the development and application of other algorithms for solar power generation prediction in the future. Based on the conclusions drawn in this study, it is believed that establishing a pre-prediction model for solar power generation can improve the accuracy of solar power generation prediction and contribute to the stabilisation of the domestic power system to some extent.

However, this study has some limitations in the process of establishing a solar power generation forecasting model. Firstly, this study is based on Korea, which is a country with four distinct seasons. The solar power generation varies significantly between summer and winter, and the importance of the variables changes dynamically [59]. Therefore, although this study observed that the ensemble model that considers all variables outperforms other single models, it has not been proven that the model has the best predictive power in countries where temperature or weather changes are relatively low over the course of a year. Therefore, it is inappropriate to apply the results of this study to all countries, and further comparative analyses using data from each country are needed. Second, in order to determine the impact of fine particulate matter and atmospheric GHG concentrations on solar power generation, this study did not consider the correlation and multicollinearity of these variables. However, there are studies that show a high correlation between PM₁₀ and PM_2.5 [60] and a significant correlation between PM_2.5 and O₃ [61]. Therefore, further variable selection process is needed to establish an optimal solar power generation prediction model.

Author Contributions

Conceptualization, J.L.; Methodology, J.A.; Validation, J.L.; Formal analysis, J.A. and J.L.; Investigation, J.L.; Resources, J.A. and J.L.; Data curation, J.A.; Writing—original draft, J.A. and J.L.; Writing—review & editing, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received funding from the ‘Junior Faculty Research Support Grant’ at Changwon National University in 2024 and the 5th Educational Training Program for the Shipping, Port, and Logistics from the Ministry of Ocean and Fisheries.

Data Availability Statement

Data can be available on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Newell, R.; Raimi, D.; Villanueva, S.; Prest, B. Global Energy Outlook 2021: Pathways from Paris; Resources for the Future: Washington, DC, USA, 2021. [Google Scholar]
Bhattarai, U.; Maraseni, T.; Apan, A. Assay of Renewable Energy Transition: A Systematic Literature Review. Sci. Total Environ. 2022, 833, 155159. [Google Scholar] [CrossRef]
Oh, H.; Hong, I.; Oh, I. South Korea’s 2050 Carbon Neutrality Policy. East Asian Policy 2021, 13, 33–46. [Google Scholar] [CrossRef]
Moon, S.; Kim, Y.; Kim, M.; Lee, J. Policy Designs to Increase Public and Local Acceptance for Energy Transition in South Korea. Energy Policy 2023, 182, 113736. [Google Scholar] [CrossRef]
Güney, T. Solar Energy, Governance and CO2 Emissions. Renew. Energy 2022, 184, 791–798. [Google Scholar] [CrossRef]
Jäger-Waldau, A. Snapshot of Photovoltaics—May 2023. EPJ Photovolt. 2023, 14, 23. [Google Scholar] [CrossRef]
Notton, G.; Nivet, M.-L.; Voyant, C.; Paoli, C.; Darras, C.; Motte, F.; Fouilloy, A. Intermittent and Stochastic Character of Renewable Energy Sources: Consequences, Cost of Intermittence and Benefit of Forecasting. Renew. Sustain. Energy Rev. 2018, 87, 96–105. [Google Scholar] [CrossRef]
Kawabe, K.; Tanaka, K. Impact of Dynamic Behavior of Photovoltaic Power Generation Systems on Short-Term Voltage Stability. IEEE Trans. Power Syst. 2015, 30, 3416–3424. [Google Scholar] [CrossRef]
van der Meer, D.W.; Shepero, M.; Svensson, A.; Widén, J.; Munkhammar, J. Probabilistic Forecasting of Electricity Consumption, Photovoltaic Power Generation and Net Demand of an Individual Building Using Gaussian Processes. Appl. Energy 2018, 213, 195–207. [Google Scholar] [CrossRef]
Li, X.; Wagner, F.; Peng, W.; Yang, J.; Mauzerall, D.L. Reduction of Solar Photovoltaic Resources Due to Air Pollution in China. Proc. Natl. Acad. Sci. USA 2017, 114, 11867–11872. [Google Scholar] [CrossRef]
World Health Organization. WHO Ambient Air Quality Database, 2022 Update: Status Report; World Health Organization: Geneva, Switzerland, 2023; ISBN 978-92-4-004769-3. [Google Scholar]
Son, J.; Jeong, S.; Park, H.; Park, C.-E. The Effect of Particulate Matter on Solar Photovoltaic Power Generation over the Republic of Korea. Environ. Res. Lett. 2020, 15, 084004. [Google Scholar] [CrossRef]
Salamova, A.S.; Kantemirova, M.; Statsenko, E. Dynamics and Accounting of GHG Emissions in the World. BIO Web Conf. 2023, 63, 06011. [Google Scholar] [CrossRef]
Li, H.; Ren, Z.; Xu, Y.; Li, W.; Hu, B. A Multi-Data Driven Hybrid Learning Method for Weekly Photovoltaic Power Scenario Forecast. IEEE Trans. Sustain. Energy 2022, 13, 91–100. [Google Scholar] [CrossRef]
Gao, Z.; Zhao, J.; Wang, C.; Wang, Y.; Shang, M.; Zhang, Z.; Chen, F.; Chu, Q. A Six-Year Record of Greenhouse Gas Emissions in Different Growth Stages of Summer Maize Influenced by Irrigation and Nitrogen Management. Field Crops Res. 2023, 290, 108744. [Google Scholar] [CrossRef]
Ahmed, R.; Sreeram, V.; Mishra, Y.; Arif, M.D. A Review and Evaluation of the State-of-the-Art in PV Solar Power Forecasting: Techniques and Optimization. Renew. Sustain. Energy Rev. 2020, 124, 109792. [Google Scholar] [CrossRef]
Fjelkestam Frederiksen, C.A.; Cai, Z. Novel Machine Learning Approach for Solar Photovoltaic Energy Output Forecast Using Extra-Terrestrial Solar Irradiance. Appl. Energy 2022, 306, 118152. [Google Scholar] [CrossRef]
Zang, H.; Xu, R.; Cheng, L.; Ding, T.; Liu, L.; Wei, Z.; Sun, G. Residential Load Forecasting Based on LSTM Fusing Self-Attention Mechanism with Pooling. Energy 2021, 229, 120682. [Google Scholar] [CrossRef]
Keles, D.; Scelle, J.; Paraschiv, F.; Fichtner, W. Extended Forecast Methods for Day-Ahead Electricity Spot Prices Applying Artificial Neural Networks. Appl. Energy 2016, 162, 218–230. [Google Scholar] [CrossRef]
Obando, E.D.; Carvajal, S.X.; Agudelo, J.P. Solar Radiation Prediction Using Machine Learning Techniques: A Review. IEEE Lat. Am. Trans. 2019, 17, 684–697. [Google Scholar] [CrossRef]
Lima, M.A.F.; Carvalho, P.C.; Fernández-Ramírez, L.M.; Braga, A.P. Improving Solar Forecasting Using Deep Learning and Portfolio Theory Integration. Energy 2020, 195, 117016. [Google Scholar] [CrossRef]
Ying, C.; Wang, W.; Yu, J.; Li, Q.; Yu, D.; Liu, J. Deep Learning for Renewable Energy Forecasting: A Taxonomy, and Systematic Literature Review. J. Clean. Prod. 2023, 384, 135414. [Google Scholar] [CrossRef]
Sharadga, H.; Hajimirza, S.; Balog, R.S. Time Series Forecasting of Solar Power Generation for Large-Scale Photovoltaic Plants. Renew. Energy 2020, 150, 797–807. [Google Scholar] [CrossRef]
Jung, Y.; Jung, J.; Kim, B.; Han, S. Long Short-Term Memory Recurrent Neural Network for Modeling Temporal Patterns in Long-Term Power Forecasting for Solar PV Facilities: Case Study of South Korea. J. Clean. Prod. 2020, 250, 119476. [Google Scholar] [CrossRef]
Liu, C.-H.; Gu, J.-C.; Yang, M.-T. A Simplified LSTM Neural Networks for One Day-Ahead Solar Power Forecasting. IEEE Access 2021, 9, 17174–17195. [Google Scholar] [CrossRef]
Mellit, A.; Pavan, A.M.; Lughi, V. Deep Learning Neural Networks for Short-Term Photovoltaic Power Forecasting. Renew. Energy 2021, 172, 276–288. [Google Scholar] [CrossRef]
Lee, D.; Kim, K. PV Power Prediction in a Peak Zone Using Recurrent Neural Networks in the Absence of Future Meteorological Information. Renew. Energy 2021, 173, 1098–1110. [Google Scholar] [CrossRef]
Kazem, H.A.; Yousif, J.H.; Chaichan, M.T.; Al-Waeli, A.H.; Sopian, K. Long-Term Power Forecasting Using FRNN and PCA Models for Calculating Output Parameters in Solar Photovoltaic Generation. Heliyon 2022, 8, e08803. [Google Scholar] [CrossRef]
Khan, W.; Walker, S.; Zeiler, W. Improved Solar Photovoltaic Energy Generation Forecast Using Deep Learning-Based Ensemble Stacking Approach. Energy 2022, 240, 122812. [Google Scholar] [CrossRef]
Rahman, N.H.A.; Hussin, M.Z.; Sulaiman, S.I.; Hairuddin, M.A.; Saat, E.H.M. Univariate and Multivariate Short-Term Solar Power Forecasting of 25MWac Pasir Gudang Utility-Scale Photovoltaic System Using LSTM Approach. Energy Rep. 2023, 9, 387–393. [Google Scholar] [CrossRef]
Sarmas, E.; Spiliotis, E.; Stamatopoulos, E.; Marinakis, V.; Doukas, H. Short-Term Photovoltaic Power Forecasting Using Meta-Learning and Numerical Weather Prediction Independent Long Short-Term Memory Models. Renew. Energy 2023, 216, 118997. [Google Scholar] [CrossRef]
Malakouti, S.M.; Menhaj, M.B.; Suratgar, A.A. The Usage of 10-Fold Cross-Validation and Grid Search to Enhance ML Methods Performance in Solar Farm Power Generation Prediction. Clean. Eng. Technol. 2023, 15, 100664. [Google Scholar] [CrossRef]
Ye, J.; Zhao, B.; Deng, H. Photovoltaic Power Prediction Model Using Pre-Train and Fine-Tune Paradigm Based on LightGBM and XGBoost. Procedia Comput. Sci. 2023, 224, 407–412. [Google Scholar] [CrossRef]
Cao, Y.; Liu, G.; Luo, D.; Bavirisetti, D.P.; Xiao, G. Multi-Timescale Photovoltaic Power Forecasting Using an Improved Stacking Ensemble Algorithm Based LSTM-Informer Model. Energy 2023, 283, 128669. [Google Scholar] [CrossRef]
Babalhavaeji, A.; Radmanesh, M.; Jalili, M.; González, S.A. Photovoltaic Generation Forecasting Using Convolutional and Recurrent Neural Networks. Energy Rep. 2023, 9, 119–123. [Google Scholar] [CrossRef]
Wang, Z.; Jia, L. Short-Term Photovoltaic Power Generation Prediction Based on LightGBM-LSTM Model. In Proceedings of the 2020 5th International Conference on Power and Renewable Energy (ICPRE), Shanghai, China, 12–14 September 2020; pp. 543–547. [Google Scholar]
Ju, Y.; Sun, G.; Chen, Q.; Zhang, M.; Zhu, H.; Rehman, M.U. A Model Combining Convolutional Neural Network and LightGBM Algorithm for Ultra-Short-Term Wind Power Forecasting. IEEE Access 2019, 7, 28309–28318. [Google Scholar] [CrossRef]
Ren, J.; Yu, Z.; Gao, G.; Yu, G.; Yu, J. A CNN-LSTM-LightGBM Based Short-Term Wind Power Prediction Method Based on Attention Mechanism. Energy Rep. 2022, 8, 437–443. [Google Scholar] [CrossRef]
Sun, L.; Koopialipoor, M.; Jahed Armaghani, D.; Tarinejad, R.; Tahir, M.M. Applying a Meta-Heuristic Algorithm to Predict and Optimize Compressive Strength of Concrete Samples. Eng. Comput. 2021, 37, 1133–1145. [Google Scholar] [CrossRef]
Pan, Z.; Fang, S.; Wang, H. LightGBM Technique and Differential Evolution Algorithm-Based Multi-Objective Optimization Design of DS-APMM. IEEE Trans. Energy Convers. 2020, 36, 441–455. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A Highly Efficient Gradient Boosting Decision Tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery, New York, NY, USA, 13 August 2016; pp. 785–794. [Google Scholar]
Zhu, J.; Li, M.; Luo, L.; Zhang, B.; Cui, M.; Yu, L. Short-Term PV Power Forecast Methodology Based on Multi-Scale Fluctuation Characteristics Extraction. Renew. Energy 2023, 208, 141–151. [Google Scholar] [CrossRef]
Li, X.; Ma, L.; Chen, P.; Xu, H.; Xing, Q.; Yan, J.; Lu, S.; Fan, H.; Yang, L.; Cheng, Y. Probabilistic Solar Irradiance Forecasting Based on XGBoost. Energy Rep. 2022, 8, 1087–1095. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Jozefowicz, R.; Zaremba, W.; Sutskever, I. An Empirical Exploration of Recurrent Network Architectures. Proceedings of the International Conference on Machine Learning, 2015, pp. 2342–2350. Available online: https://proceedings.mlr.press/v37/jozefowicz15.html (accessed on 10 October 2025).
Lee, J.; Cho, Y. National-Scale Electricity Peak Load Forecasting: Traditional, Machine Learning, or Hybrid Model? Energy 2022, 239, 122366. [Google Scholar] [CrossRef]
Kim, D.; Kwon, D.; Park, L.; Kim, J.; Cho, S. Multiscale LSTM-Based Deep Learning for Very-Short-Term Photovoltaic Power Generation Forecasting in Smart City Energy Management. IEEE Syst. J. 2020, 15, 346–354. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, Y.; Wen, Y.; Ren, Y.; Liang, X.; Cheng, J.; Kang, M. An Improved Stacking Ensemble Learning Model for Predicting the Effect of Lattice Structure Defects on Yield Stress. Comput. Ind. 2023, 151, 103986. [Google Scholar] [CrossRef]
Makridakis, S. Accuracy Measures: Theoretical and Practical Concerns. Int. J. Forecast. 1993, 9, 527–529. [Google Scholar] [CrossRef]
Mayer, M.J. Benefits of Physical and Machine Learning Hybridization for Photovoltaic Power Forecasting. Renew. Sustain. Energy Rev. 2022, 168, 112772. [Google Scholar] [CrossRef]
Chan, Y.H. Biostatistics 104: Correlational Analysis. Singap. Med. J. 2003, 44, 614–619. [Google Scholar]
Zou, H.; Hastie, T. Regularization and Variable Selection via the Elastic Net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
De Mol, C.; De Vito, E.; Rosasco, L. Elastic-Net Regularization in Learning Theory. J. Complex. 2009, 25, 201–230. [Google Scholar] [CrossRef]
Kim, Y.; Lee, S.H.; Kim, H.W. Prediction Method of Photovoltaic Power Generation Based on LSTM Using Weather Information. J. Korean Inst. Commun. Inf. Sci. 2019, 44, 2231–2238. [Google Scholar] [CrossRef]
Coleman, N.; Esmalian, A.; Lee, C.-C.; Gonzales, E.; Koirala, P.; Mostafavi, A. Energy Inequality in Climate Hazards: Empirical Evidence of Social and Spatial Disparities in Managed and Hazard-Induced Power Outages. Sustain. Cities Soc. 2023, 92, 104491. [Google Scholar] [CrossRef]
Jacobson, M.Z.; von Krauland, A.-K.; Coughlin, S.J.; Palmer, F.C.; Smith, M.M. Zero Air Pollution and Zero Carbon from All Energy at Low Cost and without Blackouts in Variable Weather throughout the US with 100% Wind-Water-Solar and Storage. Renew. Energy 2022, 184, 430–442. [Google Scholar] [CrossRef]
Ebhota, W.S.; Tabakov, P.Y. Influence of Photovoltaic Cell Technologies and Elevated Temperature on Photovoltaic System Performance. Ain Shams Eng. J. 2023, 14, 101984. [Google Scholar] [CrossRef]
Kim, E.; Akhtar, M.S.; Yang, O.-B. Designing Solar Power Generation Output Forecasting Methods Using Time Series Algorithms. Electr. Power Syst. Res. 2023, 216, 109073. [Google Scholar] [CrossRef]
Zhou, X.; Cao, Z.; Ma, Y.; Wang, L.; Wu, R.; Wang, W. Concentrations, Correlations and Chemical Species of PM2. 5/PM10 Based on Published Data in China: Potential Implications for the Revised Particulate Standard. Chemosphere 2016, 144, 518–526. [Google Scholar] [CrossRef]
Xie, Y.; Zhao, B.; Zhang, L.; Luo, R. Spatiotemporal Variations of PM_2.5 and PM₁₀ Concentrations between 31 Chinese Cities and Their Relationships with SO₂, NO₂, CO and O₃. Particuology 2015, 20, 141–149. [Google Scholar] [CrossRef]

Figure 1. Flowchart of research steps.

Figure 2. Structure of LSTM.

Figure 3. Correlation matrix plot of meteorological features.

Figure 4. Graph of the stacked ensemble model’s PV power forecasting: (a) Weather; (b) Weather + PM; (c) Weather + GHG; (d) Weather + PM + GHG.

Figure 5. Daily average trend of PV generation and temperature in Gyeongnam (August 2023).

Figure 6. Actual electricity demand and market demand by time of day.

Table 1. Summary of related previous studies.

Author(s)	Year	Main Variables	Model(s)	Description
Sharadga et al. [23]	2020	pow	BiLSTM, LSTM, LRNN, SARIMA, ARIMA, ARMA	Neural networks are more accurate than statistical models
Ahmed et al. [16]	2020	irr, ta, ws, wd, h	ANN based models	Ensemble of ANN is best for forecasting short term PV forecast. CNN is found to excel in eliciting a model’s deep underlying non-linear input–output relationships.
Jung et al. [24]	2020	irr, ta, h, ws, pre, cc, sun, month of operation	LSTM-RNN	A stacked LSTM-RNN model is good for monthly PV power output forecasting in new photovoltaic systems.
Liu et al. [25]	2021	irr, ta, h, tpv	MLP, LSTM	LSTM outperformed MLP in terms of performance
Mellit et al. [26]	2021	pow	LSTM, GRU, Bi-LSTM, Bi-GRU, CNN, CNN-LSTM, GRU-CNN	LSTM, GRU based model is good, especially for very short-term forecasting.
Li et al. [14]	2021	ta, h, p, cc	GRU-CNN	GRU-CNN hybrid model enables effective learning of spatiotemporal characteristics in solar power generation data.
Lee and Kim [27]	2021	ta, h, cc, rad, pow, day of month, month of year	LSTM, GRU	The GRU-based model showed superior and more robust performance compared to ANN and LSTM.
Kazem et al. [28]	2022	ta, irr	FRNN, PCA	FRNN better simulates the experimental results curve than PCA.
Khan et al. [29]	2022	t, day, pow, irr, h, ta	ANN, LSTM, Bagging, DSE-XGB	DSE-XGB is a stacked ensemble algorithm utilizing ANN and LSTM. It shows a 10–12% improvement in the R² value compared to other single models.
Rahman et al. [30]	2023	irr, tpv, pow	LSTM	LSTM showed superior performance in multivariate settings over univariate models for longer time horizons.
Sarmas et al. [31]	2023	irr, ta	Meta, Stack-LSTM, BiLSTM, CNN-LSTM, ConvLSTM, EW, META	On average, Stack-LSTM provides the highest forecasting accuracy, and meta-learning enhances the accuracy by up to 5% over the best base model.
Malakouti et al. [32]	2023	ta, tpv, irr	DT, Light GBM, ET	The prediction power between models is similar.
Ye et al. [33]	2023	irr	Light GBM-XGBoost, XGBoost, LightGBM	Ensemble models perform better than single models.
Cao et al. [34]	2023	Pyranometer, rad, ta, ws	LSTM-Informer, LSTM, BiLSTM, Informer, Autoformer, Stack-ETR	LSTM excels in short-term forecasting. They presented a stacking ensemble algorithm-based LSTM-Informer model for short and long-term forecasting.
Babalhavaeji et al. [35]	2023	irr, h, ta, ws	CNN-GRU, LSTM, GRU, CNN-LSTM	They pioneered CNN-GRU combination for solar power forecasting. Ensemble models are better than single models.

Notes: pow—PV power; irr—irradiance; ang—solar angles; ta—air temperature; tpv—PV temperature; h—humidity; cc—cloud coverage; pre—precipitation; ws—wind speed; wd—wind direction; p—pressure; t—time; rad—radiation. Only the most relevant features have been mentioned, but some authors included more.

Table 2. Table of variables.

Category	Subcategory	Features
Input	Time	Year, Month, Day, Hour
	Meteorological variables	Temperature, Wind speed, Wind direction, Humidity, Air pressure, Sunshine, Insolation, Cloud cover, Ground temperature
	Particulate matter	PM₁₀, PM_2.5
	GHG	SO₂, NO₂, O₃, CO
Output	Photovoltaic power generation

Table 3. Performance evaluation factors of each model.

Dataset		Model	RMSE	MAE	MSE	R²
1	Weather	Stacked ensemble	11.0401	6.6573	121.88	0.9854
		LightGBM	11.5406	6.8913	133.1844	0.9840
		XGBoost	11.7203	7.1504	137.3665	0.9835
		LSTM	17.3475	12.4682	300.9371	0.9639
2	Weather + PM	Stacked ensemble	10.9826	6.5692	120.6165	0.9855
		LightGBM	11.4977	6.9889	132.1973	0.9841
		XGBoost	13.0001	7.7861	169.0039	0.9797
		LSTM	18.6931	13.5628	349.4316	0.9581
3	Weather + GHG	Stacked ensemble	11.0030	6.6166	121.0668	0.9855
		LightGBM	11.7156	7.0447	137.2564	0.9835
		XGBoost	12.6112	7.5726	159.0428	0.9809
		LSTM	16.8774	12.1153	284.8481	0.9658
4	Weather + PM/GHG	Stacked ensemble	10.7245	6.4501	115.0156	0.9862
		LightGBM	11.2146	6.9358	125.7684	0.9849
		XGBoost	12.809971	7.5284	164.0954	0.9803
		LSTM	19.6150	14.8621	384.7480	0.9538

Table 4. Detailed structure of the stacked ensemble model.

Dataset	Base Model			RMSE	MAE
Dataset	LightGBM	XGBoost	Total	RMSE	MAE
Weather	5 (83.81%)	4 (16.19%)	9	11.0401	6.6573
Weather + PM	5 (75.93%)	6 (24.07%)	11	10.9826	6.5692
Weather + GHG	6 (86.90%)	5 (13.10%)	11	11.0030	6.6166
Weather + PM/GHG	5 (88.40%)	2 (11.60%)	7	10.7245	6.4501

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Impact on Predictive Performance of Air Pollutants in PV Forecasting Using Multi-Model Ensemble Learning: Evidence from the Port Logistics Hinterland Area

Abstract

1. Introduction

2. Methodology

2.1. Light GBM

2.2. XGBoost

2.3. LSTM

2.4. Stacked Ensemble

2.5. Evaluation of the Prediction Power

3. Data and Results

3.1. Dataset

3.2. Elastic Net

3.3. Comparison Solar Power Forecasting Model Performance

3.4. Comparison of Detailed Structure and Performance of Stacked Ensemble Models

3.5. Discussion

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics