1. Introduction
Jeju Island in South Korea has embarked on the ambitious journey of building a digital-model-based smart island emphasizing sustainable growth and environmental stewardship [
1,
2]. This initiative involves a broad spectrum of stakeholders, from technology companies to governmental organizations, collaborating to foster innovation and incorporate emerging technologies. The collaborative approach has engendered a conducive ecosystem where the exchange of ideas and expertise thrives, creating advanced solutions tailored to address the island’s unique challenges. Both public and private entities have proactively invested resources in research and development, forging strategic partnerships, and launching pilot projects. These concerted efforts aim to ensure the seamless integration of digital modeling technology into the island’s existing infrastructure. This infrastructure is envisaged to be further augmented and enhanced with the help of these digital models. In the initial phase, the digital models have been largely employed in energy management, weather information analysis, and performance optimization of renewable energy systems [
3]. Employing these models aims to improve the predictability and efficiency of energy production and distribution, optimize the use of renewable energy sources (RESs), and make informed decisions based on accurate weather forecasts. Implementing digital models reinforces Jeju Island’s commitment to sustainable growth and environmental protection.
The application of digital modeling technology in virtual power plants (VPPs) provides a pertinent example of its transformative potential. VPPs, which manage multiple distributed energy resources through a centralized control system via an energy management system (EMS), have seen marked enhancements in efficiency and stability due to integrating these digital models [
4,
5]. One of the primary features of these digital models is their ability to facilitate real-time monitoring and analysis of the virtual representations of renewable energy facilities. This capability enables operators to promptly identify and address potential issues such as a decline in power generation or equipment failure, thereby mitigating the potential for disruption in energy supply [
6,
7]. Another compelling advantage of these models is their ability to simulate outcomes based on various input factors. By doing so, they provide crucial insights that can be used to regulate energy demand and supply more effectively, thereby enhancing power stability. Additionally, these digital models facilitate the development of optimized operational strategies to improve power generation efficiency and reduce energy waste [
8]. This capability is especially relevant given the growing global emphasis on resource conservation and efficiency. Ultimately, successfully deploying digital models is a stepping stone toward achieving sustainable energy transitions. By improving the performance and efficiency of renewable energy facilities, these models contribute to reducing the reliance on fossil fuels, thereby supporting a shift towards more sustainable energy sources [
9,
10].
Photovoltaic (PV) technology represents a significant avenue for renewable energy, contributing significantly towards sustainable growth and environmental protection [
11]. PV technology harnesses the sun’s radiant energy, transforming it into electricity, emitting minimal carbon emissions, thus mitigating the environmental impact [
12]. There have been marked advancements in renewable energy technologies, particularly in PV systems. These developments, combined with improvements in solar modules, have significantly reduced installation costs and increased power generation efficiency per unit area [
13,
14]. Such attributes make PV technology an attractive choice for densely populated nations such as South Korea, where space availability for large-scale energy generation facilities is limited. However, despite these significant strides, PV power generation faces an inherent challenge—its heavy dependence on weather conditions. This dependence introduces unpredictability that can be difficult to manage [
15]. The potential variability in solar irradiance, the primary energy source for PV power generation, can impact the efficiency and output of PV power plants. Therefore, accurate prediction of future solar irradiance is essential for effectively managing PV systems and optimizing their power output. Harnessing this energy resource to its maximum potential requires sophisticated forecasting models capable of anticipating solar irradiance fluctuations and adjusting PV power plants’ operations [
16].
However, achieving such precise forecasts presents substantial challenges, largely due to the variable temporal characteristics of solar irradiance and the complex, nonlinear interdependencies between different weather factors [
16,
17]. Researchers have concentrated on improving solar irradiance prediction methodologies in response to these challenges and acknowledging the burgeoning demand for RESs.
Table 1 complements this aim, examining the state-of-the-art techniques used in renewable energy forecasting studies. The scope of the methods explored in this overview is broad and includes traditional machine learning (ML), ensemble learning (EL), deep learning (DL), and a host of hybrid models. These technological tools represent the forefront of methodological advancement, each contributing to improving the accuracy and efficiency of solar irradiance forecasts, enabling better utilization of PV power systems.
Chaibi et al. [
18] compared the performance of the light gradient boosting machine (LightGBM) in global solar irradiance forecasting to other ML models, namely, support vector machine (SVM), random forest (RF), and adaptive boosting (AdaBoost). This study also used permutation feature importance (PFI) and Shapley additive explanations (SHAP) to understand the importance of various input features and provide model interpretability. Ghimire et al. [
19] proposed a hybrid DL convolutional neural network (CNN)-stacked regression (REGST) method for predicting daily global solar irradiation. Using a meta-heuristic method, they used data from six solar energy farms in Queensland, Australia and selected features. The proposed REGST model was compared with other DL and conventional ML approaches, showing that the hybrid model performed significantly better in global solar irradiation predictions. Ghimire et al. [
20] also introduced a new hybrid DL model called CSVR, which combined CNN with support vector regression (SVR) for global solar irradiation predictions. The model was developed using meteorological variables from Global Climate Model and ground-based observations, and the optimal features were selected through a metaheuristic atom search optimization method. The performance of the CSVR model was compared with other DL and ML methods, demonstrating that it offered several predictive advantages over alternative models.
Santos et al. [
21] investigated ensemble-based dynamic selection methods with various base models, such as SVM, RF, and multilayer perceptron (MLP), for solar irradiance prediction. The authors showed that the dynamic selection approach could improve forecasting accuracy by combining the strengths of multiple base models and adaptively selecting the most suitable model for a given situation. Park et al. [
22] proposed a domain hybrid solar irradiation prediction model that combined LightGBM, wavelet transform, complete ensemble empirical mode decomposition with adaptive noise, and MLP. The hybrid approach improved the accuracy of solar irradiance forecasting by taking advantage of the strengths of each technique. Moon et al. [
23] proposed recurrent neural network (RNN)-based DL models, namely, long short-term memory (LSTM), bidirectional LSTM (Bi-LSTM), and gated recurrent unit (GRU), for time-series forecasting of hourly solar irradiance for Seoul, Busan, and Incheon. The proposed models were evaluated against baseline regression models and their variants. The results showed that attention mechanism-based Bi-LSTM and GRU models trained with the scaled exponential linear unit activation function provided the best multistep-ahead solar irradiance forecasting performance.
In previous research, several studies were conducted to predict solar irradiance on Jeju Island. Jung et al. [
24] proposed an attention mechanism-based LSTM network model for short-term solar irradiance prediction, utilizing historical solar irradiance and weather data from two different regions in Jeju Island. Park et al. [
25] developed a multistep-ahead solar irradiance forecasting model using the LightGBM, outperforming other EL and DL models. Moon et al. [
26] compared tree-based EL methods and discovered that the Cubist model provided better day-ahead hourly solar irradiance prediction than others. However, the Cubist model’s longer training time led to the RF model’s consideration of the ranger package [
27] as a more suitable option due to its significantly faster training process in low-performance computing environments. Incorporating multiple hybrid models can enhance solar irradiance predictions, ultimately optimizing PV power generation and supporting smart island initiatives. Although improvements were observed with the single EL model, there remains room for increased accuracy, highlighting the importance of using multiple hybrid models in solar irradiance prediction [
28].
This research introduces a novel model, HYTREM (short for hybrid tree-based ensemble learning model), which improves the prediction of hourly solar irradiance on Jeju Island. HYTREM offers an innovative solution by combining the strengths of previous studies and contemporary trends in the field, effectively addressing the limitations of DL models that demand high-performance computing resources and single-tree-based EL models that may lack robustness. By generating separate learning datasets for each hour, HYTREM’s performance is significantly enhanced, leading to more accurate solar irradiance predictions. This innovative approach promotes interdisciplinary collaboration and drives progress in renewable energy management. Moreover, this research provides insights into potential advancements towards digital twin technology, significantly contributing to renewable energy forecasting. With HYTREM, the reliability of solar irradiance predictions can be improved, thereby increasing the practical applicability of renewable energy management strategies.
The main contributions of this study are as follows:
The novel model HYTREM is developed to accurately predict solar irradiance hourly on Jeju Island. Leveraging a variety of input variables and EL techniques, the model can effectively adapt to unique hourly patterns of solar irradiance.
Ranger-based RF models are utilized separately for each hour of the day, recognizing temporal variations in solar irradiance. This strategy accommodates the limited computational resources of energy providers and enhances the accuracy of predictions.
An online learning model incorporating time-series cross-validation (TSCV) with input variables is introduced. This approach allows the model to capture the latest solar irradiance trends and patterns for each specific hour, ensuring its relevancy to real-world conditions.
Transparent decision-making is emphasized through a variable importance analysis, visualizing crucial factors impacting the model’s prediction accuracy. This transparency fosters a data-driven approach to renewable energy management.
Although the present study primarily introduces a robust digital model for accurate solar irradiance prediction, it also lays the groundwork for future advancements toward digital twin technology. Despite current limitations, such as the lack of real-time communication and synchronization between the digital twin and physical twin, this digital model is a crucial step toward realizing a comprehensive digital twin in future research.
The structure of this paper unfolds as follows:
Section 2 and
Section 3 provide detailed descriptions of data preprocessing and the construction process of HYTREM.
Section 4 presents the experimental results, validating HYTREM’s performance.
Section 5 discusses HYTREM’s broader applicability across various geographical regions and its limitations and suggests potential future research directions. Finally,
Section 6 concludes by summarizing findings and highlighting prospects in solar irradiance forecasting.
2. Data Preprocessing
In this study, the aim was to construct a solar irradiance forecasting model using date/time, meteorological data, and historical solar irradiance data provided by the Korea Meteorological Administration (KMA). The focus was on two regions, Ildo 1-dong (latitude: 33.51411 and longitude: 126.52969) and Gosan-ri (latitude: 33.29382 and longitude: 126.16283), located on Jeju Island, the largest island in South Korea. Jeju Island is actively implementing various measures to transition into a smart island by shifting from conventional fossil fuels to RESs. The data collection period spanned eight years, from 2011 to 2018, between 8 a.m. and 6 p.m. During this period, data on sky condition, temperature, humidity, wind speed, and solar irradiance were collected, along with other meteorological observation data such as soil temperature, total cloud volume, ground-surface temperature, and sunshine amount. However, the analysis was limited to sky condition, temperature, humidity, and wind speed, provided by KMA’s short-term weather forecasts [
29], as shown in
Figure 1. By considering the solar irradiance and weather conditions of these two regions, the aim was to assess the applicability of PV systems in South Korea, given Jeju Island’s commitment to energy independence through RESs.
In this study, the issue of needing more data in the collected meteorological information was addressed. Approximately 0.1% of the total data for each category, including temperature, humidity, wind speed, and solar irradiance, had missing values indicated as −1. To estimate these missing values, linear interpolation was employed due to their continuous data characteristics. Logistic regression was utilized to approximate missing values for sky condition data, which were presented as categorical values ranging from 1 to 4. To effectively reflect the periodicity of the date and maintain consistency with the previous study [
25,
26], a day-ahead hourly solar irradiance forecasting was performed using the same independent variables as shown in
Table 2. The date information was converted to Julian dates, ranging from 1 to 365 for common years or 366 for leap years, where January 1 corresponded to 1 and December 31 to 365, or 366 in leap years.
Subsequently, one-dimensional (1D) date data were transformed into a two-dimensional (2D) continuous space using a sinusoidal transformation (Equations (1) and (2)), following the method detailed in [
30,
31]. The transformation used sine and cosine functions to map each day of the year onto a point in a unit circle, a standard practice when dealing with cyclic or periodic data, as it preserved the natural continuity between the endpoints of the cycle (i.e., 31 December and 1 January).
where Day
sin and Day
cos represent the coordinates of the point on the unit circle for the given Julian Date.
A significant improvement in the multiple correlation coefficients and the coefficient of determination, with solar irradiance as the dependent variable, was observed when the date data were transformed from 1D to 2D, as shown in
Table 3. In the 2D space, the multiple correlation coefficients for Ildo 1-dong and Gosan-ri increased from 0.061 to 0.424 and from 0.059 to 0.379, respectively. Similarly, the coefficients of determination also showed substantial improvement, increasing from 0.004 to 0.180 in Ildo 1-dong and from 0.003 to 0.144 in Gosan-ri.
The enhancement can be attributed to the 2D transformation, allowing for better capturing of the cyclical nature of time data. Time data are intrinsically circular (i.e., after 31 December comes 1 January), and a linear representation (i.e., Julian dates) may not effectively capture this cyclical characteristic [
32]. In contrast, a 2D representation, such as the sinusoidal transformation used here, maps time onto a circle, thereby maintaining the cyclical continuity of the data [
33,
34]. This enhanced representation of the data’s cyclical nature contributes to the higher correlation and determination coefficients observed in the 2D space, providing a more accurate and efficient representation of the periodicity in the collected meteorological data [
34].
The sky condition comprises four categories on an interval scale from one to four: clear, partly cloudy, mostly cloudy, and cloudy, with the cloud amount represented by eleven scales according to the climatology 1/10 method used by the KMA. To effectively represent these categorical data, one-hot encoding was employed. A value of 1 was assigned to the binary variable representing a specific sky condition, and 0 was assigned to the other conditions. Similarly, one-hot encoding was used to represent the hour factor on an interval scale from 8 (8 a.m.) to 18 (6 p.m.), taking into consideration that solar irradiance typically peaked between 12 p.m. and 2 p.m. [
25,
26]. To account for recent trends in solar irradiance, historical weather conditions from one day before the prediction time, including sky condition, temperature, humidity, wind speed, and solar irradiance, were incorporated as independent variables. This comprehensive representation of sky conditions, time intervals, and recent weather trends allowed for more accurate and adaptable predictions of intermittent solar irradiance in the model.
3. Model Construction
In this section, the construction of HYTREM, a tree-based EL approach focused on day-ahead hourly solar irradiance forecasting in Jeju Island, is introduced. The overall architecture of the HYTREM procedure is presented in
Figure 2. The dataset was divided into a training set (in-sample) and a testing set (out-of-sample), covering the periods 2011–2016 and 2017–2018, respectively, with an approximate ratio of 75:25. Tree-based EL approaches are powerful ML techniques for regression and classification problems [
35,
36]. These approaches enable the development of high-accuracy prediction models with satisfactory performance and easily interpretable results. Unlike linear models, such as multiple linear regression (MLR), tree-based EL approaches can effectively capture nonlinear relationships between input and output variables and apply them to various ML applications in the energy sector [
37,
38].
For the HYTREM procedure, four tree-based EL approaches were utilized: XGBoost, LightGBM, CatBoost, and RF. Tree-based methods are well-suited for capturing the nonlinear characteristics of intermittently collected solar irradiance data, enabling them to address various problems in the ML domain [
39]. This study proposes an innovative approach by constructing a hybrid tree-based EL model that uses the highly correlated output values from XGBoost, LightGBM, CatBoost, and RF with solar irradiance as input variables instead of employing a stacking EL method using each model’s forecasting values. This approach aims to forecast hourly solar irradiance accurately for the previous day. Each model’s functionality and advantages are briefly summarized below.
XGBoost [
40] is a gradient boosting algorithm suitable for various classification and regression problems. With the integration of several optimization techniques into the fundamental boosting algorithm, XGBoost attains high performance and efficiency. It sequentially adds new models to minimize prediction errors. This process is elaborated mathematically in Equation (3)
where for a given objective function
L, indicating the algorithm’s emphasis on minimizing loss, thereby ensuring high forecasting accuracy. XGBoost includes L1 and L2 regularization and offers an early stopping feature to prevent over-fitting, improving the model’s generalization performance. With parallel processing capabilities utilizing multiple central processing unit (CPU) cores during tree construction, XGBoost achieves fast training speeds, ideal for large datasets [
40,
41]. It also features an automatic missing value handling mechanism, allowing model training without separate preprocessing steps [
41].
LightGBM [
42] is an efficient and high-performance focused gradient boosting-based algorithm. It employs a leaf-wise tree construction method, enhancing efficiency and learning speed and making it suitable for large-scale datasets. LightGBM incorporates multiple optimization techniques based on gradient boosting to enhance forecasting performance, as expressed in Equation (4):
which formulates the leaf-wise tree growth algorithm used. LightGBM also provides regularization, pruning, and early stopping features to prevent overfitting and supports GPU acceleration for faster training speeds. It accommodates various objective functions and evaluation metrics, enabling its application to classification, regression, and other problems using custom objective functions. LightGBM includes an automatic missing value handling feature, allowing model training without separate preprocessing steps.
CatBoost [
43] is a gradient-boosting-based algorithm demonstrating high performance in datasets with numerous categorical features and applies to various classification and regression problems. CatBoost employs a proprietary categorical feature handling method, eliminating the need for separate one-hot or label encoding steps and shortening data preprocessing time. It delivers high forecasting performance by utilizing multiple optimization techniques based on gradient boosting. Additionally, CatBoost supports various loss functions, enabling its application to classification and regression problems. CatBoost is also designed to be less sensitive to hyperparameter settings, making it more user-friendly and easier to fine-tune for optimal performance.
RF [
44] is an ensemble learning method applicable to various classification and regression problems. By generating multiple decision trees (DTs) and aggregating their forecasting results, RF attains higher accuracy than individual DTs. The mathematical representation of this process can be illustrated using Equation (5):
where
h(
x,
Θk,
γk) symbolizes the
kth tree’s decision function, and the RF combines these individual trees. Data and variables are randomly sampled during tree construction to create diverse tree structures, effectively mitigating the overfitting issue associated with DTs [
44,
45]. RF is known for its robust performance even without extensive hyperparameter tuning, as the application of several established methods (e.g., setting the number of features to
p/3 for regression or sqrt(
p) for classification, where
p is the number of predictors [
45]; or the number of trees to a sufficiently large value such as 128 or higher [
46]) often yields satisfactory results. Additionally, RF can handle variables with missing values, reducing the data preprocessing burden. Due to its advantages of accuracy, simplicity, and flexibility, RF is widely used in various fields.
The selection of the RF model as the final model in this study stemmed from its compatibilities with the specific requirements and constraints of the study rather than a universal claim of superiority. A key attribute distinguishing RF from other boosting algorithms was its unique ability to evenly learn from all features, a characteristic enabled by handling the number of features during model training [
37,
47]. This hyperparameter provided a more balanced learning approach. Another strength was RF’s strong performance, even without extensive hyperparameter tuning, contrasting gradient boosting machine (GBM)-based models [
28,
48]. These models often necessitate meticulous adjustments of hyperparameters for optimal performance. Additionally, RF models inherently mitigate overfitting issues common with individual DTs [
47,
48]. While boosting algorithms can also handle overfitting, it often involves additional hyperparameter tuning and a nuanced understanding of the bias-variance trade-off. Thus, the selection of RF considered its balanced learning capability and alignment with the needs and constraints of the study.
Furthermore, whereas XGBoost, LightGBM, and CatBoost models are primarily implemented in Python, RF requires online learning for applying time-series cross-validation, necessitating an alternative library. Python’s scikit-learn and R’s randomForest libraries offer excellent options for implementing RF; however, they present certain limitations, such as insufficient optimization for high-dimensional datasets and slow model implementation speeds. The R’s ranger (RANdom Forest GEneRator) package [
27], an efficient and rapid RF algorithm implementation, was employed to address these challenges in constructing the RF model. The ranger package significantly improves the speed and memory usage of the original RF algorithm, ensuring enhanced performance when processing large datasets [
26,
37].
This study employed XGBoost, LightGBM, CatBoost, and RF models to forecast day-ahead solar irradiance for the Jeju region. For the ranger model training, 34 input variables were used, including four prediction values from the XGBoost, LightGBM, CatBoost, and RF models, as well as the existing 30 input variables known to generate fluctuations in the current trends and patterns for solar irradiance [
28]. To generate the required prediction values of XGBoost, LightGBM, CatBoost, and RF as input variables for training the ranger model during both training and testing sets, the prediction values were first generated for the training set using five-fold cross-validation. Using traditional methods, the XGBoost, LightGBM, CatBoost, and RF models were then constructed and trained with the existing 30 input variables. Day-ahead hourly solar irradiance predictions were performed on the testing set. The training and testing sets were restructured using the prediction values as input variables, resulting in 34 variables.
Table 4 and
Table 5 display Pearson correlation coefficients (PCCs) and
p-values, reflecting the relationships between solar irradiation and each of the 34 independent (input) variables included in the study.
In
Table 4 and
Table 5, the PCCs and p-values quantify the degree and significance of the correlation between each of the 34 variables and solar irradiance, measuring their linear association. The correlation coefficients vary between −1 and +1. A positive value indicates a direct relationship, whereas a negative value indicates an inverse relationship. For instance, in the Ildo 1-dong dataset (
Table 4), Temp has a positive correlation (0.307), suggesting that solar irradiance also tends to increase as temperature increases. On the other hand, Day
cos has a negative correlation (−0.417), with low values in summer and high values in winter, indicating an inverse relationship with solar irradiance. The accompanying p-values indicate the statistical significance of these correlations. A lower p-value suggests a more statistically significant relationship. For instance, Temp in the Ildo 1-dong dataset (
Table 4) shows a p-value of less than 0.001, implying a statistically significant relationship with solar irradiance. This correlation analysis uncovers potential linear relationships between each input variable and the output variable solar irradiance in the given datasets. It highlights their independent considerations within the model while acknowledging possible real-world interdependencies.
TSCV was applied to the ranger model for implementing an online learning approach, addressing data shortage issues, reflecting recent solar irradiance patterns, and adjusting the weights of input variables [
49,
50]. TSCV concentrates on multiple forecast horizons for each testing set, as shown in
Figure 3. Different training sets were used, depending on the scheduled period, including one observation not used in the initial training set. To effectively combine the strengths of XGBoost, LightGBM, CatBoost, and RF models in the final ranger model trained with TSCV, the prediction values from these models were incorporated as input variables. To capture the temporally distinct characteristics of solar irradiance, the data were divided by the hour of the day, and the input variables for the model were adjusted accordingly. During this process, hour-related variables such as T8 (8 a.m.) to T18 (6 p.m.) were removed to focus better on unique features and patterns of specific hours.
In response to the divided dataset, 11 separate RF models were constructed using the ranger package and TSCV, each tailored to a specific hour of the day. This approach allowed for a better account of the unique characteristics of solar irradiance at each hour, resulting in more accurate predictions of fluctuations and patterns associated with different times of the day. By customizing the models to accommodate these hourly variations, the predictions’ overall forecasting accuracy and reliability can be improved [
34]. Furthermore, variable importance values were extracted for each model to ensure the approach’s credibility, providing insights into the factors driving solar irradiance patterns at different hours and periods. This process series is called the hybrid-tree-based ensemble model (HYTREM). By leveraging the hour-dependency dataset and the combined strengths of the XGBoost, LightGBM, CatBoost, and RF models, accurate solar irradiance predictions were efficiently and effectively achieved through HYTREM.
5. Discussion
Further investigation has been conducted to demonstrate the robustness and wide applicability of the proposed HYTREM beyond the context of Jeju Island, extending to diverse geographical regions. A previous study [
23] investigated the solar irradiance in Seoul, Busan, and Incheon, three metropolitan cities that actively utilize renewable energy resources, and highlighted the impressive integration of infrastructure resources in these cities, which has propelled the successful implementation of smart city initiatives. From 2011 to 2020, solar irradiance data were consistently gathered every hour from 8 a.m. to 6 p.m. The same 34 input variables in this research were assembled for day-ahead hourly solar irradiance forecasting. The data from 2011 to 2018 and 2019 to 2020 were utilized as this experiment’s training and testing sets, respectively.
As evidenced in
Table 10, when the HYTREM’s performance was compared to that of Att-LSTM, Att-Bi-LSTM, and Att-GRU models, it displayed superior outcomes across all cities. More specifically, for Seoul, HYTREM reported an MAE, RMSE, and NRMSE of 0.129, 0.193, and 15.535, respectively. These scores were significantly lower than those of the Att-LSTM, Att-Bi-LSTM, and Att-GRU models, demonstrating HYTREM’s robust predictive power. Similar trends were observed in Busan and Incheon, further reinforcing HYTREM’s superior performance in forecasting solar irradiance. In conclusion, these findings highlighted the broad applicability and versatility of HYTREM, reinforcing its potential to generate accurate solar irradiance predictions in diverse geographical and climatic conditions. Nevertheless, it was advised that the model be fine-tuned and validated with local data for each specific application to ensure optimal performance.
Nevertheless, despite its demonstrated capabilities, transitioning HYTREM from its current digital state to the desired digital twin technology will require addressing certain limitations and outlining areas for improvement.
The dialogue between scientific advances and recognizing their limitations naturally guides the evolution of new ideas and avenues for future research. While HYTREM showcases commendable predictive accuracy in forecasting solar irradiance across major South Korean cities such as Jeju Island, Seoul, Busan, and Incheon, it recognizes certain limitations. These include the need for further refinement to accommodate diverse global conditions such as the Sahara desert’s intense heat, the Andes’ high altitudes, or the cool climates of Scandinavia.
In its current form, the model represents a digital model rather than the aspirational digital twin technology. Its enhancement, in this regard, would necessitate the integration of key aspects of digital twin technology, including synchronization in terms of time and data and the reciprocal communication between the digital twin and its physical counterpart.
Additionally, the architecture encompassing the digital and physical twins must be included. This highlights an area ripe for growth and expansion in the model’s scope.
Although HYTREM exhibits significant promise, the accuracy and reliability of its forecasts are largely contingent on the quality and diversity of the available data. Hence, while potentially broad, its applicability to various geographical and climatic conditions requires additional validation against a more extensive dataset.
Considering these identified limitations, various promising avenues have been envisioned for future research and development.
Future research should enhance the model by integrating the crucial elements of digital twin technology, including synchronization and communication aspects, and establishing a comprehensive architecture to house both the digital twin and its physical twin.
There is a need to expand the dataset used for training the model. This would facilitate a more comprehensive validation of the model’s effectiveness and improve its accuracy across diverse geographical regions and climatic conditions.
As a long-term ambition, the evolution of the current digital model into a fully-fledged digital twin technology is proposed. The current study serves as a foundational step towards developing such advanced models, setting the course for digital twin technology as a significant future direction for this line of research.
Lastly, considering Jeju Island’s commitment to evolving into a digital twin-based smart island, a promising direction for future research is applying the model to support the island’s sustainable development initiatives. This could involve leveraging the model to optimize renewable energy generation, encourage eco-friendly transportation, and improve intelligent energy management systems on the island, thus potentially contributing to the island’s pioneering efforts in sustainable development and digital innovation.
6. Conclusions
The research presented in this paper developed a hybrid ensemble model, named HYTREM, which aimed at robust solar irradiance forecasting primarily for Jeju Island. The objective was to enhance the efficiency of solar photovoltaic power systems, given the Island’s unique commitment to becoming a digital twin-based smart island. Nevertheless, the model was envisioned as a versatile tool that, albeit currently limited in its digital twin features, demonstrated potential in accurate solar irradiance prediction utilizing diverse weather observation data. The HYTREM used various input variables, including timestamp, weather, historical solar irradiance, and predicted values of tree-based EL models, including RF, XGBoost, LightGBM, and CatBoost, which had a strong relationship with solar irradiance. It was acknowledged that solar irradiance patterns differed for each hour. Separate ranger-based RF models were built, enabling more accurate predictions while catering to energy providers with limited computational resources. An online learning model trained on these input variables through TSCV was constructed to effectively reflect recent solar irradiance trends and patterns by each hour of the day.
As presented in
Table 6, the experimental results demonstrated that the proposed HYTREM outperformed the state-of-the-art forecasting models regarding MAE, RMSE, and NRMSE values across both regions. Compared to the baseline models, HYTREM achieved significant improvements in prediction accuracy. Specifically, in the Ildo 1-dong testing set, HYTREM reduced the MAE by approximately 74.3% compared to Att-LSTM w/dropout, 64.8% compared to Att-LSTM w/o dropout, 56.4% compared to Att-Bi-LSTM, 59.2% compared to Att-GRU, 61.6% compared to Cubist, 61.1% compared to LightGBM, and 74.3% compared to the Persistence model. Similarly, in the Gosan-ri testing set, HYTREM achieved reductions of approximately 64.1%, 62.6%, 47.7%, 46.4%, 61.4%, 44.2%, and 80.1% in MAE compared to the respective baseline models mentioned earlier. These results confirmed the superior performance of HYTREM in day-ahead solar irradiance prediction and established it as the most accurate model among the compared state-of-the-art models. The exceptional performance of HYTREM showcased its potential applicability and effectiveness in real-world solar energy forecasting. This comprehensive comparison provided robust evidence supporting the capabilities of the proposed model and its significant enhancements in solar irradiance predictions, ensuring reliable decision-making for renewable energy management.
The research is significant in fostering interdisciplinary innovation and enhancing real-world applicability in renewable energy management. Exploring other ensemble methods to improve the model’s performance further and investigating its robustness under extreme weather conditions can provide insights into its resilience to unexpected changes in weather patterns. In addition, assessing the model’s impact on the overall power system will help evaluate its effectiveness in enhancing the efficiency of solar power generation systems and inform decisions about its wider adoption. This could promote the widespread adoption of solar energy and sustainable development on Jeju Island and other regions.
However, it should be noted that the conclusions drawn from this study have been mainly based on data from Jeju Island. While the model can potentially be applied to other regions, the implications may vary due to differences in geographical and climatic conditions. Future research should validate the model with diverse datasets to ensure its robustness and applicability in different geographical contexts, such as plains, plateaus, hills, or mountains, and include integrating digital twin technology features into the model, enhancing synchronization and interaction between the digital twin and its physical counterpart.