Efficient Machine Learning-Based Prediction of Solar Irradiance Using Multi-Site Data
Abstract
1. Introduction
Method | Description | Advantages | Disadvantages |
---|---|---|---|
Statistical Methods [6] | Utilize historical weather data to identify patterns and predict future GHI. Includes methods like regression analysis, ARIMA, and time-series analysis. | Simple, quick, and often effective for short-term forecasts. | May not capture complex weather patterns; less effective for long-term forecasts. |
Physical Models [7] | Rely on physical principles and equations that govern the atmosphere and solar radiation. These models include clear-sky models and radiation transfer models. | More accurate as they are based on physical laws; suitable for clear-sky conditions. | Require detailed atmospheric data; computationally intensive. |
Hybrid Models [8] | Combine statistical and physical models to leverage the strengths of both approaches for more accurate predictions. | Can provide better accuracy by integrating multiple data sources and methods. | Complex to implement and may require substantial computational resources. |
Machine Learning Techniques [7] | Use algorithms like neural networks, support vector machines, and random forests to learn from large datasets and make predictions. | Capable of handling large datasets and capturing complex relationships. | Require large amounts of data and computational power; can be a ‘black box’. |
Satellite-Based Models [9] | Employ satellite imagery to estimate GHI by analyzing the amount of cloud cover and other atmospheric conditions. | Provide spatially comprehensive data and can cover large areas. | Depending on satellite data availability and quality, they may have time delays. |
Numerical Weather Prediction (NWP) Models [10] | Use complex mathematical models to simulate the atmosphere and predict GHI based on weather forecasts and other meteorological data. | Highly accurate and can provide detailed forecasts for various time scales. | Computationally expensive and requires high-performance computing resources. |
1.1. Problem Formulation
- Photovoltaic cells:
- Operation: PV cells, commonly arranged into solar panels, directly convert sunlight into electricity.
- Usage: This method is widely used in residential, commercial, and utility-scale solar energy systems.
- Concentrated solar power (CSP) plants:
- Operation: CSP plants use mirrors or lenses to concentrate sunlight onto a small area, generating heat. This heat is then used to produce electricity.
- Scale: Typically, large-scale operations requiring substantial infrastructure and investment.
- Advantage: CSP can provide a consistent and reliable renewable energy source.
- Direct normal irradiance (DNI): Solar radiation received directly from the sun, unaffected by the atmosphere.
- Diffuse radiation (DI): Indirect radiation scattered by atmospheric particles, clouds, or other meteorological elements.
- Albedo: The ratio of the reflected to incident radiation flux.
Reference | Contribution | Models Used | Result |
---|---|---|---|
[17] | Develop a real-time solar irradiance forecasting technique | Cloud tracking techniques, implanted fisheye network camera, ANN algorithm | The techniques outperformed existing forecasting models at two locations. |
[18] | Review the use of ANNs in solar power generation forecasting and analyze measurement instruments | ANN | ANN usage increases accuracy; hybrid systems and instrument calibration improve accuracy. |
[19] | Forecast hourly GHI using ensemble model | Extreme gradient boosting forest, deep neural network, ridge regression | The proposed model outperformed standalone ML and DL models in stability and accuracy. |
[20] | Compare ML and DL methods for predicting solar irradiance | SVR, Random Forest, polynomial regression, ANN, CNN, RNN | DL algorithms were more accurate but required more computational power than ML algorithms. |
[21] | Predict daily solar irradiance | Feed Forward Neural Nets (FFNNs), empirical models, Holt–Winters, RSM | ANN model outperformed Holt–Winters, RSM, and empirical models. |
[22] | Model solar radiation using robust soft computing method | Least square support vector machine, multi-verse optimizer algorithm, genetic algorithm, gray wolf optimization, sine cosine algorithms | The proposed technique outperformed others in terms of accuracy. |
[23] | Develop hourly day-ahead solar irradiance forecasting model | LSTM+RNN, FFNN | The proposed approach outperformed FFNN; simulation showed 2% increase in annual energy savings. |
[24] | Predict solar irradiance using hybrid models and probabilistic forecasts | PSO-XGBoost, PSO-LSTM, PSO-GBRT, ANN, CNN, LSTM, RF, GBRT, XGBoost | PSO-LSTM showed superiority for day-ahead solar prediction. |
Current | Predict GHI for the next temporal step with 6 different stations at once, different characteristics, and different lagged versions | Tree-based algorithms, MLP, ensemble learners | Tested with different sites that have different features. Analyzed the impact of GHI for different time-shifted input data. |
1.2. Motivation, Contributions, and Organization
2. Proposed Solution
2.1. Dataset Presentation and Acquisition
- Station identifier: Each station is uniquely identified by an 8-character code.
- Date range: Users can specify a date range (begin and end dates) for data extraction with the format YYYYMMDD. If no dates are specified, the most recent day’s data is returned by default.
- Comprehensive headers: Each dataset includes a header describing the columns, which may change over time. Users should read the header for proper usage.
- Instrument configuration changes: If the specified date range spans an instrument configuration change, the API returns an error, indicating the date of the change. Users need to make separate queries for data before and after the change.
- Field API: Before querying the Data API, users can call the Field API to know the available measurement parameters for a station over time.
- Station List API: Provides a comma-separated list of station identifiers, names, latitude, longitude, and elevation.
- AIM Meta Data API: Offers information on instrument calibration history, location, maintenance, etc.
- Station ID and name.
- Date range between January 2020 and December 2022.
- All available environmental and irradiance parameters (e.g., GHI, DHI, DNI, temperature, pressure).
2.2. Proposed Framework
- Data collection from multiple PV sites: Solar irradiance and environmental data were collected for six PV stations over three years using the MIDC Raw Data API. For each site, the raw files were parsed into structured DataFrames with consistent timestamp formatting (30 min resolution).
- Data preprocessing pipeline: The preprocessing consisted of the following clearly defined steps:
- –
- Synchronization and alignment: All station data were synchronized based on UTC timestamps. Any mismatched timestamps or duplicated entries were dropped. Feature names were standardized (e.g., GHI, DHI, DNI) across stations to ensure schema consistency. This means that columns were renamed and standardized (e.g., “Global Horizontal” → “GHI”). Units were converted where necessary to ensure consistent W/m2 scales.
- –
- Concatenation: After alignment, datasets from the six stations were concatenated into a unified dataset. Each row retained its station identifier, enabling station-specific modeling and evaluation while allowing for global analysis.
- –
- Data cleaning:
- ∗
- Features with more than 30% missing values were dropped.
- ∗
- Remaining missing values were imputed using forward-fill interpolation.
- ∗
- Outliers were identified using the IQR method and removed per feature.
- –
- Resampling and time lagging: Each DataFrame was resampled to a specific interval. Then, a windowing function was applied to generate time-lagged features. Specifically, lagged versions of GHI, DHI, DNI, and other relevant features were created for , , …, intervals (e.g., past 30, 60, 90 min) to support prediction at or .
- Feature selection: A two-step feature selection strategy was adopted:
- –
- Kendall correlation coefficients were calculated between all feature pairs, excluding the target variable. To mitigate multicollinearity, one feature from each highly correlated pair (correlation coefficient > 0.95) was removed. Furthermore, Kendall correlation coefficients were also computed between each feature and the target variable (GHI). Features exhibiting a correlation coefficient below 0.3 with the target were discarded. This selection process ensured the retention of the most relevant and informative predictors.
- –
- Feature importance scores were computed using an ensemble of tree-based models. The importance scores were averaged across models, and the top 15 most influential features were retained for each station.
- –
- Incremental feature selection: To further refine the input space and evaluate the contribution of individual features, an incremental feature selection strategy was applied. Starting from the most important feature, models were iteratively trained by progressively adding one feature at a time. At each step, model performance was evaluated using validation metrics (e.g., RMSE and R2). This approach enabled the identification of the optimal subset of features that maximized predictive performance while minimizing the risk of redundancy and overfitting.
- Model training and evaluation strategy:
- –
- Training pipeline: The predictive task is formulated as a regression problem. The following steps were performed for each station independently:
- ∗
- Train/test split: Each station’s cleaned and feature-selected dataset was split into 80% training and 20% testing sets. To avoid data leakage and ensure temporal integrity, the dataset for each site was chronologically ordered. We then used a single-split strategy, where the first 80% of the observations (based on timestamps) were used for training and the remaining 20% were reserved for testing. This approach avoids shuffling or interleaving of future observations into the training set.
- ∗
- Model initialization: Each model was initialized using a fixed random seed (42) to ensure reproducibility.
- ∗
- Multiple regressors were trained: MLPRegressor (3 × 100 hidden layers), GradientBoosting, RandomForest, CatBoost, ExtraTrees, etc. All models used fixed seeds and default hyperparameters for reproducibility. The set of employed models includes:
- ·
- MLPRegressor: The best configuration for the model was with 3 hidden layers, each with 100 neurons, relu activation, Adam optimizer, and 10,000 epochs.
- ·
- GradientBoosting, RandomForest, ExtraTrees, CatBoost, XGBoost, LightGBM: All tree-based regressor models were initialized using Scikit-learn default parameters, with hyperparameter tuning skipped for uniform comparison.
- –
- Model evaluation: Each model’s performance was evaluated using five metrics: R2, RMSE, MAE, median absolute error, and MAPE.
- Ensemble modeling: In addition to standalone models, to improve generalization, a stacking-based ensemble was built for each station. Stacking ensemble models were implemented using Scikit-learn’s StackingRegressor with 5-fold internal cross-validation.
2.3. Description of ML Models and Performance Metrics
- MLP: A Multi-Layer Perceptron regressor with a neural network architecture consisting of three hidden layers, each containing 100 neurons. The model is set to run for a maximum of 10,000 iterations.
- Decision TreeRegressor: A decision tree regressor that uses a tree-like model of decisions to predict a target value.
- RandomForest: An ensemble learning method that constructs a multitude of decision trees during training and outputs the mean prediction of the individual trees.
- ExtraTrees: An ensemble learning technique similar to the RandomForest, but it uses the entire dataset instead of a bootstrap sample and selects split points at random.
- GradientBoosting: An ensemble technique that builds trees sequentially, each one correcting errors made by the previous ones.
- CatBoost: A gradient boosting regressor that handles categorical features natively and is designed to be fast and accurate. The model has its verbosity set to 0 to suppress output during training.
- LightGBM: A gradient boosting framework that uses tree-based learning algorithms. It is designed to be efficient and fast, especially with large datasets.
- XGBoost: An optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.
2.4. Exploratory Data Analysis
- What does the outlier study reveal?Outlier analysis, an essential step for detecting factors such as fluctuations in incident radiation, solar panel performance, and faults, was performed at the three sites. The presence of outliers can be attributed to various factors beyond those initially mentioned. Notably, outliers were observed in the GHI series across all stations, as shown in Figure 3. At the NELHA site, outliers were found in the global PAR, global UV-PFD, and global UV measurements. At the HSU site, outliers were present in the diffuse irradiance variants, direct normal irradiance, and to some extent, the airmass feature. Lastly, at the ULL station, outliers were detected in the GHI, DNI, DHI, and CR1000 temperature measurements. Outliers in solar irradiance data often result from rapid and unpredictable changes in environmental conditions, sensor inaccuracies, and calibration issues. For instance, GHI outliers across all stations may stem from sudden weather changes, such as cloud cover or precipitation, which impact the total solar radiation measured. At the NELHA site, outliers in global PAR, UV-PFD, and UV measurements could be due to local variations in vegetation or atmospheric conditions. At the HSU site, the presence of outliers in diffuse irradiance, direct normal irradiance, and the airmass feature suggests sensitivity to atmospheric variability and measurement inconsistencies. Finally, the ULL station exhibited outliers in GHI, DNI, DHI, and CR1000 temperature readings, likely caused by sudden temperature fluctuations or sensor exposure to environmental factors. Understanding these causes is crucial for improving data accuracy and the reliability of solar irradiance forecasts. The UNLV station, in contrast, exhibited outliers in only two series: GHI and wind cloud temperature. This suggests the impact of these values on the model’s performance in relation to the time-shifted values. Similar to the UNLV station, the RAZON station had outliers present in only two series, GHI and DNI, whereas the NWTC station was characterized by the absence of outliers.
- What does the correlation study reveal?Correlation analysis was conducted in parallel across all station features using the Kendall method due to its robustness against outliers, as shown in Figure 4. This analysis highlighted variability in feature correlations across different sites. For example, the features most correlated with the GHI series were global PAR, CR1000 temperature, solar PV temperature, and average wind direction, with all but wind direction showing correlations exceeding 50%. At the ULL station, the GHI series was highly correlated with DNI and DHI, with the CR1000 temperature and solar panel temperature also being notable factors, although solar panel temperature had a correlation of 50%. At the HSU station, the CR510 battery and temperature showed the strongest correlations with the GHI. From these observations, it can be inferred that the CR1000 temperature, components of the GHI, and solar panel temperature are key factors in predicting GHI across stations. These features exhibit positive correlations with the GHI. Conversely, negative correlations were noted for relative humidity, CR1000 battery, and zenith angle, suggesting these factors may inversely affect GHI predictions. The NELHA station is characterized by a high correlation between the global PAR and GHI, as well as the CR1000 and temperature measurements. The UNLV station exhibits the best feature correlations and is similarly characterized by values of GHI that are correlated with global UVA, UVB, and the components of GHI, which are DNI and DHI. Additionally, the airmass shows a good correlation with the target. For the RAZON station, the components of GHI, PR1, and PH1 temperatures, pressure, and the RAZON status with the azimuth angle all demonstrate a correlation with the target that exceeds 50%. It is also important to note the presence of negative correlations. For some stations, the zenith angle can have a positive correlation with the target, while for others, it may have a negative impact. This can also be extended to the azimuth angle. Furthermore, features with a correlation of less than 0.3 can be dropped, which is a threshold used particularly for stations that have strong correlations, exceeding 50%, with the target GHI.
- How is GHI linked to the time-lagged versions?In this step, we focused on the lagged version of the GHI across the three stations ULL, NELHA, and HSU. We performed autocorrelation and partial autocorrelation analyses using the ACF and PACF functions over a 30-min horizon. This involved resampling the data to 30 min intervals and examining how the GHI values are related to past values at intervals of 30 min, 1 h, and 1.5 h. We considered the NELHA station as an example for the tests. The results of these analyses are shown in Figure 5.For the NELHA station, the time-lagged versions exhibit degrading behavior over time. The PACF, on the other hand, indicates that the current GHI value is primarily linked to two highly correlated past values. Additionally, smaller correlation values extend up to five past values, which correspond to a time lag of up to 3 h. A similar observation is noted for the ULL and HSU stations, where these lagged versions of GHI are mainly correlated with the values from the last 30 min. This raises another question about how many historical shifts can influence the final results. As detailed in the previous analysis, a notable relationship exists between the shifted and current values of GHI. This relationship is particularly significant when constructing data windows for prediction models. Three main machine learning models—RandomForest (RF), CatBoost, and Multi-Layer Perceptron (MLP)—were employed to analyze the impact of these shifted values on GHI prediction. The selection of these models was based on their demonstrated efficacy in previous research on GHI prediction [4]. A thorough analysis was conducted for at least eight past shifts, with the results illustrated in Figure 6 and Figure 7 for 30 min-ahead, 1 h-ahead, and 2 h-ahead forecasting. The analysis of the shift influence on the variations in and RMSE was performed for all the stations across 8 shifts. For , it can be observed that for a 30 min shift, all stations exhibit a consistent result across the 8 shifts, except NWTC and, to some extent, the ULL station. However, when using shifts with higher time intervals, such as 1 h and 2 h, the results become unstable, except for the UNLV station, which shows stable performance. This could indicate that, regardless of the shift considered, the UNLV station’s model performance will remain stable. For RMSE, stable performance is observed for a 30 min shift, except for the NWTC station. For 1 h and 2 h shifts, across the 8 shifted values, a non-stable performance is observed for all stations, with minimal instability for the UNLV station. It can be concluded that the shift changes impact the and RMSE performance values for all stations, except for the UNLV station, where the change is minimal yet exists.
- Which features hold significant impact as predictors?In alignment with the previous procedure, a tree average importance technique was applied to the three stations NELHA, HSU, and ULL to identify the optimal features that lead to the best outcomes within the regressors. For the NELHA site, when considering only the 10 most important features, it was found that the lagged version of GHI up to 30 min (from GHI at T-1 to GHI at T-30 min), the accumulated global irradiance up to a 30 min lag, and up to three lags were the most impactful features for predicting GHI at T+30 min. Other significant factors included two lagged versions of the accumulated irradiance and three-to-five lagged versions of GHI. For the ULL site, the important features were limited to the lagged version of GHI up to 30 min in the past, and to a lesser extent, the second lagged versions up to 1 h earlier. Other values were mostly not significant. This pattern extended to the HSU site, where only two features were crucial: the lagged version of GHI up to 30 min and the lagged version of the zenith angle up to 30 min. These observations indicate a common consensus across all sites, highlighting that the most important feature for prediction is the first lagged version of GHI. All these results can be seen in Figure 8.
3. Results and Discussion
- −
- Results from each station: NELHA, ULL, HSU, UNLV, NWTC, and RAZON;
- −
- Ensemble learners in different stations;
- −
- Hourly GHI prediction and link with time-lagged versions;
- −
- Sensitivity analysis through time shifting;
- −
- UNLV use case.
3.1. Results from Each Station
- NELHA: This station is characterized by the presence of 19 features, of which several were found to be correlated with the target variable, as shown earlier. Moreover, a study was conducted on the correlations that can be found between features with the consideration of the past time shift. This is visualized in the correlation heatmap in Figure 9a. The correlations exceed 90% for the shifted GHI values with the global PAR. In addition, the shifted HST values also demonstrate strong correlations, reaching 88%. This indicates that a possible construction of a feature model could enclose these shifted values. The results from the models trained on this station are depicted in Table 6. It can be seen that the best results are obtained when using the GradientBoosting regressor, achieving an of 0.96 and an RMSE of 63.61, followed by the MLP and the RandomForest regressor. Overall, all the models performed well, except the decision tree. As for the residual analysis at the NELHA station in Figure 10a, it indicates that the errors are centered around 0, predominantly ranging between −200 and 200. However, errors can occasionally reach values between −400 and 400. Points outside these intervals suggest the presence of outliers or potential overfitting of the model.
- ULL: The correlation heatmap in Figure 9b and performance table in Table 6 provide a comprehensive view of the relationships between various time-shifted irradiance measurements and their impacts on model performance. The high correlation between ghi_t1_30min and ghi_t2_30min at 0.96 indicates a strong temporal autocorrelation, which is also reflected in the significant correlation of 0.91 with other features such as dni_t1_30min (0.86) and ghi_t3_30min. This suggests that GHI measurements are closely related to direct normal irradiance (DNI) over short time intervals. Furthermore, the moderate correlation between dni_t1_30min and dhi_t1_30min at 0.70 highlights the partial dependency between direct and diffuse horizontal irradiance (DHI). The models’ performance metrics, with values ranging from 0.89 to 0.95, demonstrate that models like the MLPRegressor and CatBoostRegressor can capture these intricate relationships effectively. The low RMSE and MAE values further underscore the models’ accuracy in predicting irradiance values, with MLPRegressor showing the best overall performance. This indicates that leveraging these correlations can significantly enhance predictive modeling in solar irradiance forecasting. Residual analysis at the ULL station in Figure 10b demonstrates a generally good performance by the MLP regressor. The errors are centered around 0, with most falling within the range [−200, 200]. However, some data points exhibit errors reaching up to 600.
- HSU: In this case, 12 features were considered, with those directly associated with GHI—namely, DNI, DHI, and GHI itself—being excluded. The remaining 12 features were retained due to their high correlation with the target variable, as presented in Figure 9c. The correlation heatmap observations highlight significant relationships among various features over 30 min intervals. Notably, the correlation between ghi_t1_30min and ghi_t2_30min is high at 0.97, indicating strong temporal autocorrelation in GHI measurements. Moreover, ghi_t1_30min shows strong correlations with CR510 Battery [VDC]_t1_30min0.82 and ghi_t3_30min at 0.92, suggesting consistent patterns in GHI over time. Zenith Angle [degrees]_t1_30min has a strong negative correlation with ghi_t1_30min at and ghi_t2_30min at , indicating that as the sun ascends, the GHI increases. Furthermore, Azimuth Angle [degrees]_t6_30min shows moderate correlations with other features, like Zenith Angle [degrees]_t1_30min at 0.47 and Zenith Angle [degrees]_t-2_30min at 0.99, reflecting the sun’s position and its impact on irradiance. Lastly, the high correlation at 0.89 between Direct Normal (calc) [W/m2]_t1_30min and Diffuse Horiz (shadowband) [W/m2]_t1_30min indicates a strong relationship between these two types of irradiance. On the other hand, the model performance table in Table 6 highlights that CatBoostRegressor and other ensemble methods like GradientBoostingRegressor and RandomForestRegressor achieve high values (approximately 0.97), indicating strong predictive capabilities. These models also exhibit low RMSE and MAE values, confirming their accuracy in predicting irradiance values. The median absolute error (median AE) is consistently low, particularly for RandomForestRegressor and ExtraTreesRegressor, suggesting fewer large prediction errors. When visualizing Figure 10c, the CatBoost regressor residuals analysis for predicting GHI in the next 30 min across the time axis, it is evident that the errors are generally within ±200. Only a few outliers exceed this range, indicating that the model performs well over the entire three-year dataset. It can also be observed that this error margin narrows to less than ±100 at certain time points, such as between January 2022 and March 2022.
- UNLV: A total of 20 features exist at this station, representing the GHI variations, the ambient conditions, and the performance monitoring of the solar panels. The time-shifted values revealed a strong correlation between global UVA and UVE for up to 2 h, as depicted in Figure 9d. Similarly, shifted GHI and DNI values showed high correlation for up to 1 h. Subsequently, tree-based algorithms were trained to predict GHI values for the next 30 min. The performance comparison of the models shows that all tested algorithms achieve high values, indicating strong predictive accuracy, as viewed in Table 6. RF performs slightly better, with the highest of 0.9814 and the lowest MAE of 14.67, reflecting minimal average prediction error. CatBoostRegressor follows closely, with similar accuracy but a slightly higher RMSE and MAE. MLPRegressor and GradientBoostingRegressor show good performance as well, though they exhibit higher RMSE and MAE compared to RF. ET is comparable to RF in terms of and error metrics. DecisionTreeRegressor, while having a lower and higher RMSE and MAE, still shows reasonable performance but with less accuracy. In this comparison, RandomForestRegressor emerges as the most effective model.
- RAZON: The RAZON station is characterized by 15 features that include the GHI, DNI, DHI, ambient conditions, solar PV temperature, azimuth angle, and battery operating conditions. The training process, like all the other stations, consisted of studying the correlation with lagged versions and employing the most influential features as inputs for the models. Figure 9e provides a feature correlation study, while Table 6 provides the obtained results. The correlation heatmap for RAZON over 30 min intervals reveals significant relationships among the variables, particularly between the GHI and DNI across different time shifts (, , , …). Notably, the GHI values at and have very high positive correlations with each other, at 0.95, and with DNI at corresponding time shifts. Conversely, the azimuth and zenith angles exhibit lower correlations with irradiance variables, indicating less dependency. This suggests that while the angles provide important contextual data, the irradiance measures are more interdependent. This analysis underscores the critical role of temporal shifts in understanding irradiance patterns and their predictive modeling. The performance analysis in Table 6 reveals that the GradientBoostingRegressor and RandomForestRegressor models exhibit the highest values (0.8903 and 0.8897, respectively), indicating strong predictive accuracy. RandomForestRegressor slightly outperforms the other models in terms of MAE, with a value of 28.47 compared to GradientBoostingRegressor’s 30.93, suggesting it has lower average prediction errors. ExtraTreesRegressor shows similar performance to RandomForestRegressor but with slightly higher RMSE and MAE. MLPRegressor has a significantly higher MAE (56.59) and RMSE (108.05), indicating poorer performance. CatBoostRegressor also performs well, though not as effectively as GradientBoosting or RandomForest, with an of 0.8650 and higher RMSE and MAE. DecisionTreeRegressor shows the lowest performance, with the lowest of 0.8238 and the highest RMSE of 131.79, indicating higher prediction errors and less accuracy compared to the other models. In this comparison, RandomForestRegressor and GradientBoostingRegressor are the most effective models.
- NWTC: The NWTC station is characterized by the presence of more than 20 features, from the ambient to the operating conditions and the GHI, DHI, and DNI values. In this station, as shown in Figure 9f, all features exhibit a strong correlation with values shifted by 30 min, except for the monthly series, which displays weak correlations. This is expected, given that the PACF and ACF functions are designed to capture dependencies in lagged values up to 30 min. The model’s performance, shown in Table 6, reveals that GradientBoosting stands out, with the highest value of 0.9942, demonstrating superior predictive accuracy and the best fit to the data. It also has the lowest RMSE of 71.26 and a low MAE of 34.00, indicating minimal average prediction errors and superior precision. In contrast, the MLP, with the lowest of 0.9744 and the highest RMSE of 149.92, shows the weakest performance, reflecting higher prediction errors and less accuracy. RandomForest and CatBoost also perform well, with high values of 0.9893 and 0.9868, respectively, and relatively low error metrics, though slightly less favorable than GradientBoosting. Notably, ExtraTrees exhibits higher errors compared to these top models, while DecisionTree, though strong, has a slightly higher RMSE and MAE than the top performers. Residual analysis at the NWTC station in Figure 10d demonstrates generally good performance by the GradientBoosting regressor. The errors are centered around 0, with most falling within the range [−250, 25]. However, some data points exhibit errors reaching up to 982.
3.2. Ensemble Learners in Different Stations
3.3. Hourly GHI Prediction and Link with Time-Lagged Versions
3.4. UNLV Station Use Case
3.5. Benchmarking with ARIMA and Individual Non-Shifted Features
3.6. Feature Impacts with SHAP
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Tuned Hyperparameters for Benchmark Models
Model | Best Hyperparameters |
---|---|
RandomForest Regressor | n_estimators = [10, 200], max_depth: [10–18] |
GradientBoosting Regressor | n_estimators = [10–150], learning_rate: 0.05, max_depth: 10 |
Decision Tree Regressor | max_depth: [3–12] |
MLP Regressor | hidden_layer_sizes: (100, 100), learning_rate_init: 0.001 |
CatBoost Regressor | learning_rate = 0.03, depth: 6 |
LightGBM Regressor | n_estimators: [50, 200], learning_rate: 0.02, num_leaves: [10–40] |
XGBoost Regressor | n_estimators: [50, 180], learning_rate: 0.07, max_depth: 9 |
References
- Wang, F.; Harindintwali, J.D.; Wei, K.; Shan, Y.; Mi, Z.; Costello, M.J.; Grunwald, S.; Feng, Z.; Wang, F.; Guo, Y.; et al. Climate change: Strategies for mitigation and adaptation. Innov. Geosci. 2023, 1, 100015-61. [Google Scholar] [CrossRef]
- Agreement, P. Paris agreement. In Report of the Conference of the Parties to the United Nations Framework Convention on Climate Change (21st Session, 2015: Paris); Retrieved December; HeinOnline: Getzville, NY, USA, 2015; Volume 4, p. 2. [Google Scholar]
- Diagne, M.; David, M.; Lauret, P.; Boland, J.; Schmutz, N. Review of solar irradiance forecasting methods and a proposition for small-scale insular grids. Renew. Sustain. Energy Rev. 2013, 27, 65–76. [Google Scholar] [CrossRef]
- Allal, Z.; Noura, H.N.; Chahine, K. Machine Learning Algorithms for Solar Irradiance Prediction: A Recent Comparative Study. e-Prime-Adv. Electr. Eng. Electron. Energy 2024, 7, 100453. [Google Scholar] [CrossRef]
- Abdel-Nasser, M.; Mahmoud, K.; Lehtonen, M. Reliable solar irradiance forecasting approach based on choquet integral and deep LSTMs. IEEE Trans. Ind. Inform. 2020, 17, 1873–1881. [Google Scholar] [CrossRef]
- Wang, F.; Mi, Z.; Su, S.; Zhao, H. Short-term solar irradiance forecasting model based on artificial neural network using statistical feature parameters. Energies 2012, 5, 1355–1370. [Google Scholar] [CrossRef]
- Ramadhan, R.A.; Heatubun, Y.R.; Tan, S.F.; Lee, H.J. Comparison of physical and machine learning models for estimating solar irradiance and photovoltaic power. Renew. Energy 2021, 178, 1006–1019. [Google Scholar] [CrossRef]
- Haider, S.A.; Sajid, M.; Sajid, H.; Uddin, E.; Ayaz, Y. Deep learning and statistical methods for short-and long-term solar irradiance forecasting for Islamabad. Renew. Energy 2022, 198, 51–60. [Google Scholar] [CrossRef]
- Miller, S.D.; Rogers, M.A.; Haynes, J.M.; Sengupta, M.; Heidinger, A.K. Short-term solar irradiance forecasting via satellite/model coupling. Sol. Energy 2018, 168, 102–117. [Google Scholar] [CrossRef]
- Verbois, H.; Saint-Drenan, Y.M.; Thiery, A.; Blanc, P. Statistical learning for NWP post-processing: A benchmark for solar irradiance forecasting. Sol. Energy 2022, 238, 132–149. [Google Scholar] [CrossRef]
- Zambrano, A.F.; Giraldo, L.F. Solar irradiance forecasting models without on-site training measurements. Renew. Energy 2020, 152, 557–566. [Google Scholar] [CrossRef]
- Wang, F.; Xuan, Z.; Zhen, Z.; Li, Y.; Li, K.; Zhao, L.; Shafie-khah, M.; Catalão, J.P. A minutely solar irradiance forecasting method based on real-time sky image-irradiance mapping model. Energy Convers. Manag. 2020, 220, 113075. [Google Scholar] [CrossRef]
- Rohani, A.; Taki, M.; Abdollahpour, M. A novel soft computing model (Gaussian process regression with K-fold cross validation) for daily and monthly solar radiation forecasting (Part: I). Renew. Energy 2018, 115, 411–422. [Google Scholar] [CrossRef]
- Li, J.; Ward, J.K.; Tong, J.; Collins, L.; Platt, G. Machine learning for solar irradiance forecasting of photovoltaic system. Renew. Energy 2016, 90, 542–553. [Google Scholar] [CrossRef]
- Alzahrani, A.; Shamsi, P.; Dagli, C.; Ferdowsi, M. Solar irradiance forecasting using deep neural networks. Procedia Comput. Sci. 2017, 114, 304–313. [Google Scholar] [CrossRef]
- Yahya, N.W.; Prasad, D.R.; Sathiyamoorthy, D.; Pendyala, R. Prospects and roadmaps for harvesting solar thermal power in tropical Brunei Darussalam. Int. J. Glob. Energy Issues 2021, 43, 616–642. [Google Scholar] [CrossRef]
- Chu, Y.; Pedro, H.T.; Li, M.; Coimbra, C.F. Real-time forecasting of solar irradiance ramps with smart image processing. Sol. Energy 2015, 114, 91–104. [Google Scholar] [CrossRef]
- Pazikadin, A.R.; Rifai, D.; Ali, K.; Malik, M.Z.; Abdalla, A.N.; Faraj, M.A. Solar irradiance measurement instrumentation and power solar generation forecasting based on Artificial Neural Networks (ANN): A review of five years research trend. Sci. Total Environ. 2020, 715, 136848. [Google Scholar] [CrossRef]
- Kumari, P.; Toshniwal, D. Extreme gradient boosting and deep neural network based ensemble learning approach to forecast hourly solar irradiance. J. Clean. Prod. 2021, 279, 123285. [Google Scholar] [CrossRef]
- Bamisile, O.; Oluwasanmi, A.; Ejiyi, C.; Yimen, N.; Obiora, S.; Huang, Q. Comparison of machine learning and deep learning algorithms for hourly global/diffuse solar radiation predictions. Int. J. Energy Res. 2022, 46, 10052–10073. [Google Scholar] [CrossRef]
- Gürel, A.E.; Ağbulut, Ü.; Biçen, Y. Assessment of machine learning, time series, response surface methodology and empirical models in prediction of global solar radiation. J. Clean. Prod. 2020, 277, 122353. [Google Scholar] [CrossRef]
- Ikram, R.M.A.; Dai, H.L.; Ewees, A.A.; Shiri, J.; Kisi, O.; Zounemat-Kermani, M. Application of improved version of multi verse optimizer algorithm for modeling solar radiation. Energy Rep. 2022, 8, 12063–12080. [Google Scholar] [CrossRef]
- Husein, M.; Chung, I.Y. Day-ahead solar irradiance forecasting for microgrids using a long short-term memory recurrent neural network: A deep learning approach. Energies 2019, 12, 1856. [Google Scholar] [CrossRef]
- Sansine, V.; Ortega, P.; Hissel, D.; Hopuare, M. Solar Irradiance Probabilistic Forecasting Using Machine Learning, Metaheuristic Models and Numerical Weather Predictions. Sustainability 2022, 14, 15260. [Google Scholar] [CrossRef]
- Measurement and Instrumentation Data Center (MIDC). MIDC Raw Data API Documentation. 2024. Available online: https://midcdmz.nrel.gov/apps/data_api_doc.pl?_idtextlist (accessed on 19 July 2024).
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-Learn: Machine Learning in Python. 2011. Available online: https://scikit-learn.org/stable/ (accessed on 20 July 2024).
- Nemeth, M.; Borkin, D.; Michalconok, G. The comparison of machine-learning methods XGBoost and LightGBM to predict energy development. In Proceedings of the Computational Statistics and Mathematical Modeling Methods in Intelligent Systems: Proceedings of 3rd Computational Methods in Systems and Software 2019; Springer: Cham, Switzerland, 2019; Volume 23, pp. 208–215. [Google Scholar]
- Dasi, H.; Ying, Z.; Yang, B. Predicting the consumed heating energy at residential buildings using a combination of categorical boosting (CatBoost) and Meta heuristics algorithms. J. Build. Eng. 2023, 71, 106584. [Google Scholar] [CrossRef]
- Ahmad, T.; Manzoor, S.; Zhang, D. Forecasting high penetration of solar and wind power in the smart grid environment using robust ensemble learning approach for large-dimensional data. Sustain. Cities Soc. 2021, 75, 103269. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.I. SHAP (SHapley Additive exPlanations). 2023. Available online: https://shap.readthedocs.io/en/latest/ (accessed on 30 July 2024).
Station ID (Link: https://midcdmz.nrel.gov/apps/data_api_doc.pl?_idtextlist (accessed on 23 July 2025)) | Station Name |
---|---|
RCS | ARM Radiometer Characterization System (RCS) |
BS | Bluefield State College |
HSU | Cal Poly Humboldt Solar Radiation Monitoring System (SoRMS) |
EC | Elizabeth City State University |
LRSS | Lowry Range Solar Station (RSR) |
AODSRRLBC | NREL Solar Radiation Research Laboratory (AOD SR-CEReS L2) |
AODSRRL0S | NREL Solar Radiation Research Laboratory (AOD SkyRad L0) |
AODSRRL1S | NREL Solar Radiation Research Laboratory (AOD SkyRad L1.1) |
SS1 | Sun Spot One—San Luis Valley (RSR) |
USVIBOVB | US Virgin Islands Bovoni 2 |
USVILONA | US Virgin Islands Longford |
UFL | University of Florida |
ULL | University of Louisiana at Lafayette |
UNLV | University of Nevada—Las Vegas |
PWVUO | University of Oregon (GPS-based PWV) |
UOSMRL | University of Oregon (SRML) |
UTPASRL | University of Texas Rio Grande Valley Solar Radiation Lab |
PVSUOSMP | UofO PV Resource (Samples) |
PVSUO | UofO PV Resource |
XECS | Xcel Energy Comanche Station (RSR) |
Metric | Equation | Optimal Value | Advantages |
---|---|---|---|
1 | Indicates the proportion of variance explained by the model, making it easy to interpret. | ||
RMSE | 0 | Sensitive to large errors, providing a clear measure of model accuracy. | |
MAE | 0 | Easy to understand and less sensitive to outliers compared to RMSE. | |
MAPE | Expresses errors as a percentage, making it easier to interpret the relative size of errors. | ||
Median AE | 0 | Robust to outliers, providing a more resilient measure of central tendency. | |
1 | Similar to but more suitable for specific types of regression models, offering additional flexibility. |
Parameter | NELHA | ULL | HSU |
---|---|---|---|
Units | Units | Units | |
Air temperature | ✓(°C) | ✓(°C) | |
Avg wind direction (stdev) | ✓(°) | ||
Avg wind direction | ✓(° from N) | ✓(° from N) | |
Avg wind speed | ✓(m/s) | ✓(m/s) | |
CR1000 battery | ✓(VDC) | ✓(VDC) | |
CR1000 temp | ✓(°C) | ✓(°C) | |
Dew point temp | ✓(°C) | ||
Global horizontal | ✓(W/m2) | ✓(W/m2) | ✓(W/m2) |
Global PAR | ✓(µmol/s/m2) | ||
Global UV | ✓(W/m2) | ||
Global UV-PFD | ✓(µmol/s/m2) | ||
Peak wind speed | ✓(m/s) | ✓(m/s) | |
Precipitation (accumulated) | ✓(mm) | ✓(mm) | |
Rel humidity | ✓(%) | ✓(%) | |
Station pressure | ✓(mBar) | ✓(mBar) | |
Diffuse (stdev) | ✓(W/m2) | ||
Diffuse horizontal | ✓(W/m2) | ✓(W/m2) | |
Direct (stdev) | ✓(W/m2) | ||
Direct normal | ✓(W/m2) | ||
Global POA (stdev) | ✓(W/m2) | ||
Global POA | ✓(W/m2) | ||
Wind direction (stdev) | ✓(°) | ||
Airmass | ✓ | ||
Azimuth angle | ✓(°) | ||
CR510 battery | ✓(VDC) | ||
CR510 temp | ✓(°C) | ||
Diffuse horiz (band corr) | ✓(W/m2) | ||
Diffuse horiz (shadowband) | ✓(W/m2) | ||
Direct normal (calc) | ✓(W/m2) | ||
Zenith angle | ✓(°) |
Site | Model | R2 | RMSE | MAE | Median AE | D2 |
---|---|---|---|---|---|---|
ULL | MLP | 0.95 | 62.09 | 26.63 | 3.21 | 0.86 |
CatBoost | 0.95 | 63.93 | 26.99 | 2.77 | 0.86 | |
GradientBoosting | 0.95 | 64.33 | 28.60 | 4.19 | 0.85 | |
RandomForest | 0.95 | 64.68 | 26.64 | 1.42 | 0.86 | |
ExtraTrees | 0.95 | 65.58 | 27.07 | 1.36 | 0.86 | |
Decision Tree | 0.89 | 93.62 | 38.56 | 1.64 | 0.80 | |
HSU | CatBoost | 0.97 | 45.87 | 18.94 | 1.56 | 0.88 |
GradientBoosting | 0.97 | 46.60 | 20.35 | 3.92 | 0.87 | |
RandomForest | 0.97 | 46.61 | 18.97 | 1.17 | 0.88 | |
ExtraTrees | 0.97 | 46.64 | 18.94 | 1.12 | 0.88 | |
MLP | 0.97 | 47.22 | 21.13 | 3.29 | 0.87 | |
Decision Tree | 0.93 | 66.30 | 26.95 | 1.67 | 0.83 | |
UNLV | MLP | 0.98 | 42.82 | 15.46 | 1.92 | 0.93 |
CatBoost | 0.98 | 43.62 | 15.35 | 1.60 | 0.94 | |
RandomForest | 0.98 | 43.66 | 14.67 | 0.81 | 0.94 | |
GradientBoosting | 0.98 | 43.88 | 17.17 | 4.07 | 0.93 | |
ExtraTrees | 0.98 | 43.98 | 14.67 | 0.77 | 0.94 | |
Decision Tree | 0.96 | 60.30 | 20.23 | 1.07 | 0.91 | |
RAZON | GradientBoosting | 0.89 | 103.98 | 30.93 | 6.35 | 0.84 |
RandomForest | 0.89 | 104.26 | 28.47 | 0.60 | 0.85 | |
ExtraTrees | 0.88 | 106.78 | 28.94 | 0.59 | 0.85 | |
MLP | 0.88 | 108.05 | 56.59 | 33.04 | 0.71 | |
CatBoost | 0.87 | 115.36 | 30.37 | 2.73 | 0.84 | |
Decision Tree | 0.82 | 131.79 | 38.42 | 0.67 | 0.80 | |
NWTC | GradientBoosting | 0.99 | 71.26 | 34.00 | 11.85 | 0.89 |
Decision Tree | 0.99 | 95.11 | 38.07 | 1.77 | 0.87 | |
RandomForest | 0.99 | 96.66 | 29.62 | 1.80 | 0.90 | |
CatBoost | 0.99 | 107.49 | 32.93 | 5.14 | 0.89 | |
ExtraTrees | 0.98 | 126.01 | 31.28 | 1.74 | 0.90 | |
MLP | 0.97 | 149.92 | 103.40 | 64.56 | 0.66 |
NELHA Station | |||||||
---|---|---|---|---|---|---|---|
Model | R2 | RMSE | MAE | MAPE | Median AE | D2 | Composition |
Stacking | 0.96 | 63.61 | 28.25 | 5.69 | 2.58 | 0.88 | GradientBoosting, MLP |
ULL Station | |||||||
Model | R2 | RMSE | MAE | MAPE | Median AE | D2 | Composition |
Stacking | 0.95 | 61.87 | 25.96 | 3.83 | 1.88 | 0.87 | MLP, CatBoost, GradientBoosting |
HSU Station | |||||||
Model | R2 | RMSE | MAE | MAPE | Median AE | D2 | Composition |
Stacking | 0.97 | 45.59 | 18.46 | 0.64 | 1.18 | 0.89 | CatBoost, RandomForest |
UNLV Station | |||||||
Model | R2 | RMSE | MAE | MAPE | Median AE | D2 | Composition |
Stacking | 0.97 | 60.88 | 26.90 | 4.88 | 2.4 | 0.89 | GradientBoosting, MLP |
NWTC Station | |||||||
Model | R2 | RMSE | MAE | MAPE | Median AE | D2 | Composition |
Stacking | 0.994 | 69.888 | 28.827 | 0.523 | 2.942 | 0.904 | GradientBoosting, RandomForest |
RAZON Station | |||||||
Model | R2 | RMSE | MAE | MAPE | Median AE | D2 | Composition |
Stacking | 0.90 | 98.940 | 28.602 | 1.04174 | 1.401 | 0.853 | GradientBoosting, ExtraTrees |
30-min shift | ||||||
---|---|---|---|---|---|---|
Station | Model | R2 | RMSE | MAE | Median AE | D2 |
NELHA | GradientBoosting | 0.96 | 63.61 | 28.88 | 3.17 | 0.88 |
ULL | MLP | 0.95 | 62.09 | 26.63 | 3.21 | 0.86 |
HSU | CatBoost | 0.97 | 45.87 | 18.94 | 1.56 | 0.88 |
UNLV | MLP | 0.98 | 42.82 | 15.46 | 1.92 | 0.93 |
NWTC | GradientBoosting | 0.99 | 71.26 | 34.00 | 11.85 | 0.89 |
RAZON | GradientBoosting | 0.89 | 103.98 | 30.93 | 6.35 | 0.84 |
1-h shift | ||||||
Station | Model | R2 | RMSE | MAE | Median AE | D2 |
NELHA | MLP | 0.89 | 113.19 | 31.97 | 4.57 | 0.87 |
ULL | MLP | 0.95 | 65.01 | 31.12 | 5.72 | 0.84 |
HSU | CatBoost | 0.96 | 51.11 | 22.40 | 2.38 | 0.86 |
UNLV | MLP | 0.98 | 46.81 | 19.71 | 4.37 | 0.92 |
NWTC | ExtraTrees | 0.98 | 113.98 | 35.23 | 4.84 | 0.88 |
RAZON | GradientBoosting | 0.94 | 74.10 | 33.63 | 6.22 | 0.83 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Noura, H.N.; Allal, Z.; Salman, O.; Chahine, K. Efficient Machine Learning-Based Prediction of Solar Irradiance Using Multi-Site Data. Future Internet 2025, 17, 336. https://doi.org/10.3390/fi17080336
Noura HN, Allal Z, Salman O, Chahine K. Efficient Machine Learning-Based Prediction of Solar Irradiance Using Multi-Site Data. Future Internet. 2025; 17(8):336. https://doi.org/10.3390/fi17080336
Chicago/Turabian StyleNoura, Hassan N., Zaid Allal, Ola Salman, and Khaled Chahine. 2025. "Efficient Machine Learning-Based Prediction of Solar Irradiance Using Multi-Site Data" Future Internet 17, no. 8: 336. https://doi.org/10.3390/fi17080336
APA StyleNoura, H. N., Allal, Z., Salman, O., & Chahine, K. (2025). Efficient Machine Learning-Based Prediction of Solar Irradiance Using Multi-Site Data. Future Internet, 17(8), 336. https://doi.org/10.3390/fi17080336